AML2019

Challenge 3

Anomaly Detection (AD)


3th May 2019

Anomaly detection (AD) refers to the process of detecting data points that do not conform with the rest of observations. Applications of anomaly detection include fraud and fault detection, surveillance, diagnosis, data cleanup, predictive maintenance.

When we talk about AD, we usually look at it as an unsupervised (or semi-supervised) task, where the concept of anomaly is often not well defined or, in the best case, just few samples are labeled as anomalous. In this challenge, we will look at AD from a different perspective!

The dataset we are going to work on consists of monitoring data generated by IT systems; such data is then processed by a monitoring system that executes some checks and detects a series of anomalies. This is a multi-label classification problem, where each check is a binary label corresponding to a specific type of anomaly. Our goal is to develop a machine learning model (or multiple ones) to accurately detect such anomalies.

This will also involve a mixture of data exploration, pre-processing, model selection, and performance evaluation. We will also try one rule learning model, and compare it with other ML models both in terms of predictive performances and interpretability. Interpreatibility is indeed a strong requirement especially in applications like AD where understanding the output of a model is as important as the output itself.

Dataset Description


* Location of the Dataset on zoe

The data for this challenge is located at: /mnt/datasets/anomaly

* Files

You have a unique csv file with 36 features and 8 labels. Each record contains aggregate features computed over a given amount of time.

* Attributes

A brief outline of the available attributes is given below.

  1. SessionNumber (INTEGER): it identifies the session on which data is collected;
  2. SystemID (INTEGER): it identifies the system generating the data;
  3. Date (DATE): collection date;
  4. HighPriorityAlerts (INTEGER [0, N]): number of high priority alerts in the session;
  5. Dumps (INTEGER [0, N]): number of memory dumps;
  6. CleanupOOMDumps (INTEGER) [0, N]): number of cleanup OOM dumps;
  7. CompositeOOMDums (INTEGER [0, N]): number of composite OOM dumps;
  8. IndexServerRestarts (INTEGER [0, N]): number of restarts of the index server;
  9. NameServerRestarts (INTEGER [0, N]): number of restarts of the name server;
  10. XSEngineRestarts (INTEGER [0, N]): number of restarts of the XSEngine;
  11. PreprocessorRestarts (INTEGER [0, N]): number of restarts of the preprocessor;
  12. DaemonRestarts (INTEGER [0, N]): number of restarts of the daemon process;
  13. StatisticsServerRestarts (INTEGER [0, N]): number of restarts of the statistics server;
  14. CPU (FLOAT [0, 100]): cpu usage;
  15. PhysMEM (FLOAT [0, 100]): physical memory;
  16. InstanceMEM (FLOAT [0, 100]): memory usage of one instance of the system;
  17. TablesAllocation (FLOAT [0, 100]): memory allocated for tables;
  18. IndexServerAllocationLimit (FLOAT [0, 100]): level of memory used by index server;
  19. ColumnUnloads (INTEGER [0, N]): number of columns unloaded from the tables;
  20. DeltaSize (INTEGER [0, N]): size of the delta store;
  21. MergeErrors BOOLEAN [0, 1]: 1 if there are merge errors;
  22. BlockingPhaseSec (INTEGER [0, N]): blocking phase duration in seconds;
  23. Disk (FLOAT [0, 100]): disk usage;
  24. LargestTableSize (INTEGER [0, N]): size of the largest table;
  25. LargestPartitionSize (INTEGER [0, N]): size of the largest partition of a table;
  26. DiagnosisFiles (INTEGER [0, N]): number of diagnosis files;
  27. DiagnosisFilesSize (INTEGER [0, N]): size of diagnosis files;
  28. DaysWithSuccessfulDataBackups (INTEGER [0, N]): number of days with successful data backups;
  29. DaysWithSuccessfulLogBackups (INTEGER [0, N]): number of days with successful log backups;
  30. DaysWithFailedDataBackups (INTEGER [0, N]): number of days with failed data backups;
  31. DaysWithFailedfulLogBackups (INTEGER [0, N]): number of days with failed log backups;
  32. MinDailyNumberOfSuccessfulDataBackups (INTEGER [0, N]): minimum number of successful data backups per day;
  33. MinDailyNumberOfSuccessfulLogBackups (INTEGER [0, N]): minimum number of successful log backups per day;
  34. MaxDailyNumberOfFailedDataBackups (INTEGER [0, N]): maximum number of failed data backups per day;
  35. MaxDailyNumberOfFailedLogBackups (INTEGER [0, N]): maximum number of failed log backups per day;
  36. LogSegmentChange (INTEGER [0, N]): changes in the number of log segments.

* Labels

Labels are binary. Each label refers to a different anomaly.

  • Check1;
  • Check2;
  • Check3;
  • Check4;
  • Check5;
  • Check6;
  • Check7;
  • Check8;

Data Exploration


The very first task of a building model task is to understand the data. In this section we will load, visualize and explore the meaning of the given data.

1. First glance on data

Firstly we need to import some necessary packages:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# Display all the columns
pd.options.display.max_columns = None

A quick check on the given data shows that it's a csv file without headers. To help the analysic process more easier, we will define the columns name as below:

In [2]:
# The base path lead to the data file
base = "/mnt/datasets/anomaly"

# Columns name in the same order with the data source
features = np.array([
    "SessionNumber",
    "SystemID",
    "Date",
    "HighPriorityAlerts",
    "Dumps",
    "CleanupOOMDumps",
    "CompositeOOMDums",
    "IndexServerRestarts",
    "NameServerRestarts",
    "XSEngineRestarts",
    "PreprocessorRestarts",
    "DaemonRestarts",
    "StatisticsServerRestarts",
    "CPU",
    "PhysMEM",
    "InstanceMEM",
    "TablesAllocation",
    "IndexServerAllocationLimit",
    "ColumnUnloads",
    "DeltaSize",
    "MergeErrors",
    "BlockingPhaseSec",
    "Disk",
    "LargestTableSize",
    "LargestPartitionSize",
    "DiagnosisFiles",
    "DiagnosisFilesSize",
    "DaysWithSuccessfulDataBackups",
    "DaysWithSuccessfulLogBackups",
    "DaysWithFailedDataBackups",
    "DaysWithFailedfulLogBackups",
    "MinDailyNumberOfSuccessfulDataBackups",
    "MinDailyNumberOfSuccessfulLogBackups",
    "MaxDailyNumberOfFailedDataBackups",
    "MaxDailyNumberOfFailedLogBackups",
    "LogSegmentChange",
])

# List of anomaly types
labels = np.array([
    "Check1",
    "Check2",
    "Check3",
    "Check4",
    "Check5",
    "Check6",
    "Check7",
    "Check8"])
In [3]:
# load data using predefined headers and character ; as the delimiter
data = pd.read_csv(base + '/data.csv', sep = ';', header=None, names = np.append(features, labels))

Check the shape of our data, columns information and show its first 10 records:

In [4]:
# Display the first 10 record
print ("\nDisplay the first 10 record")
display(data.head(n=10))

# Display the number of entries, columns, its corresponding name and dtype
print ("\nDisplay the data information")
data.info()
Display the first 10 record
SessionNumber SystemID Date HighPriorityAlerts Dumps CleanupOOMDumps CompositeOOMDums IndexServerRestarts NameServerRestarts XSEngineRestarts PreprocessorRestarts DaemonRestarts StatisticsServerRestarts CPU PhysMEM InstanceMEM TablesAllocation IndexServerAllocationLimit ColumnUnloads DeltaSize MergeErrors BlockingPhaseSec Disk LargestTableSize LargestPartitionSize DiagnosisFiles DiagnosisFilesSize DaysWithSuccessfulDataBackups DaysWithSuccessfulLogBackups DaysWithFailedDataBackups DaysWithFailedfulLogBackups MinDailyNumberOfSuccessfulDataBackups MinDailyNumberOfSuccessfulLogBackups MaxDailyNumberOfFailedDataBackups MaxDailyNumberOfFailedLogBackups LogSegmentChange Check1 Check2 Check3 Check4 Check5 Check6 Check7 Check8
0 0 0 16/01/2017 04:04 1 0.0 0.0 0.0 0 0 0 0 0 0 4.77 61.86 37.48 0.0 NaN 0 52884993.0 0.0 NaN 65.69 606600.0 6804.0 79.0 444366335.0 7 8 0 0 1 32 0 0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0
1 1 1 06/02/2017 04:03 0 0.0 0.0 0.0 0 0 0 0 0 0 1.05 32.82 12.77 0.0 NaN 0 65546255.0 0.0 NaN 45.60 1818555.0 6804.0 54.0 227400051.0 3 8 0 0 1 32 0 0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0
2 2 1 20/02/2017 04:03 0 0.0 0.0 0.0 0 0 0 0 0 0 0.66 35.16 13.00 0.0 NaN 0 59582212.0 0.0 NaN 18.94 1818505.0 6804.0 54.0 234913753.0 3 8 0 0 1 32 0 0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0
3 3 2 13/02/2017 04:44 1 0.0 0.0 0.0 0 0 0 0 0 0 3.17 82.93 52.94 0.0 NaN 0 48229160.0 0.0 NaN 40.29 695934.0 6804.0 91.0 511053878.0 7 8 0 0 1 38 0 0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0
4 4 3 06/02/2017 04:31 1 0.0 0.0 0.0 0 0 0 0 0 0 2.92 76.18 20.51 0.0 NaN 0 79452443.0 0.0 NaN 49.83 959031.0 6804.0 55.0 172953445.0 7 8 0 0 1 5 0 0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0
5 5 4 06/02/2017 04:33 1 0.0 0.0 0.0 0 0 0 0 0 0 3.40 85.44 82.05 0.0 NaN 0 57984723.0 0.0 NaN 43.51 731716.0 6804.0 61.0 229332452.0 1 8 0 0 1 36 0 0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0
6 6 4 13/02/2017 04:33 1 0.0 0.0 0.0 0 0 0 0 0 0 15.44 85.62 82.46 0.0 NaN 0 59368661.0 0.0 NaN 43.51 759096.0 6804.0 63.0 246349797.0 1 8 0 0 1 36 0 0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0
7 7 0 13/02/2017 04:05 0 0.0 0.0 0.0 0 0 0 0 0 0 4.83 64.28 42.09 0.0 NaN 0 53573181.0 0.0 NaN 65.73 606600.0 6804.0 79.0 456053276.0 7 8 0 0 1 32 0 0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0
8 8 5 13/02/2017 04:01 0 0.0 0.0 0.0 0 0 0 0 0 0 5.42 85.49 59.80 0.0 NaN 0 41573532.0 0.0 NaN 66.06 606600.0 6804.0 97.0 851688877.0 7 8 0 0 1 32 0 0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0
9 9 6 06/02/2017 04:10 2 0.0 0.0 0.0 0 0 0 0 0 0 34.93 90.47 75.34 0.0 NaN 0 86330743.0 0.0 NaN 54.21 1458388.0 6804.0 72.0 432584447.0 7 8 0 0 1 34 0 0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0
Display the data information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287031 entries, 0 to 287030
Data columns (total 44 columns):
SessionNumber                            287031 non-null int64
SystemID                                 287031 non-null int64
Date                                     287031 non-null object
HighPriorityAlerts                       287031 non-null int64
Dumps                                    287028 non-null float64
CleanupOOMDumps                          287028 non-null float64
CompositeOOMDums                         287028 non-null float64
IndexServerRestarts                      287031 non-null int64
NameServerRestarts                       287031 non-null int64
XSEngineRestarts                         287031 non-null int64
PreprocessorRestarts                     287031 non-null int64
DaemonRestarts                           287031 non-null int64
StatisticsServerRestarts                 287031 non-null int64
CPU                                      261822 non-null float64
PhysMEM                                  266464 non-null float64
InstanceMEM                              264914 non-null float64
TablesAllocation                         284741 non-null float64
IndexServerAllocationLimit               260587 non-null float64
ColumnUnloads                            287031 non-null int64
DeltaSize                                286825 non-null float64
MergeErrors                              279298 non-null float64
BlockingPhaseSec                         211177 non-null float64
Disk                                     275652 non-null float64
LargestTableSize                         270781 non-null float64
LargestPartitionSize                     286881 non-null float64
DiagnosisFiles                           265108 non-null float64
DiagnosisFilesSize                       265108 non-null float64
DaysWithSuccessfulDataBackups            287031 non-null int64
DaysWithSuccessfulLogBackups             287031 non-null int64
DaysWithFailedDataBackups                287031 non-null int64
DaysWithFailedfulLogBackups              287031 non-null int64
MinDailyNumberOfSuccessfulDataBackups    287031 non-null int64
MinDailyNumberOfSuccessfulLogBackups     287031 non-null int64
MaxDailyNumberOfFailedDataBackups        287031 non-null int64
MaxDailyNumberOfFailedLogBackups         287031 non-null int64
LogSegmentChange                         251482 non-null float64
Check1                                   262520 non-null float64
Check2                                   262545 non-null float64
Check3                                   264463 non-null float64
Check4                                   250384 non-null float64
Check5                                   251997 non-null float64
Check6                                   279647 non-null float64
Check7                                   251309 non-null float64
Check8                                   286979 non-null float64
dtypes: float64(25), int64(18), object(1)
memory usage: 96.4+ MB

Our data contains 287,031 logs with 8 types of anomaly and 36 features. The data is in the form of numeric except column Date.

The data description shows that there are some features having constrain in their values such as the number of memory dumps should be a positive integer. We will check if there are any out of range values in our dataset.

First we should convert the Date column into Date type:

In [5]:
#Handle the date 
data['Date'] = pd.to_datetime(data['Date'], format = "%d/%m/%Y %H:%M")

print("Date type: ", data.Date.dtype)
Date type:  datetime64[ns]

Show the values range for each feature/label:

In [6]:
# Show value range for each feature
for f in np.append(features, labels):
#     print("%s" % f)
#     display(data[f].describe())
    # Don't take account nan values
    f_values = data[f].dropna()
    print("%s: [%s , %s] with %s unique values " % (f, min(f_values), max(f_values), f_values.unique().size))
SessionNumber: [0 , 228195] with 228196 unique values 
SystemID: [0 , 3187] with 3188 unique values 
Date: [2017-01-06 13:42:00 , 2018-01-24 00:34:00] with 55791 unique values 
HighPriorityAlerts: [0 , 24] with 25 unique values 
Dumps: [0.0 , 1429.0] with 168 unique values 
CleanupOOMDumps: [0.0 , 0.0] with 1 unique values 
CompositeOOMDums: [0.0 , 280.0] with 71 unique values 
IndexServerRestarts: [0 , 341] with 87 unique values 
NameServerRestarts: [0 , 159] with 69 unique values 
XSEngineRestarts: [0 , 150] with 63 unique values 
PreprocessorRestarts: [0 , 0] with 1 unique values 
DaemonRestarts: [0 , 0] with 1 unique values 
StatisticsServerRestarts: [0 , 9] with 9 unique values 
CPU: [0.12 , 4602.49] with 8777 unique values 
PhysMEM: [2.02 , 2070680.21] with 9596 unique values 
InstanceMEM: [0.01 , 99.15] with 9547 unique values 
TablesAllocation: [0.0 , 94.6] with 8262 unique values 
IndexServerAllocationLimit: [0.23 , 99.55] with 9885 unique values 
ColumnUnloads: [0 , 1192000] with 25506 unique values 
DeltaSize: [0.0 , 1280000000000.0] with 283833 unique values 
MergeErrors: [0.0 , 1.0] with 2 unique values 
BlockingPhaseSec: [0.0 , 18761111.0] with 10186 unique values 
Disk: [0.21 , 373103799682844.5] with 10061 unique values 
LargestTableSize: [0.0 , 2147483645.0] with 170964 unique values 
LargestPartitionSize: [0.0 , 2147483645.0] with 171934 unique values 
DiagnosisFiles: [3.0 , 356265.0] with 3102 unique values 
DiagnosisFilesSize: [1514934.0 , 1730000000000.0] with 262457 unique values 
DaysWithSuccessfulDataBackups: [0 , 14] with 15 unique values 
DaysWithSuccessfulLogBackups: [0 , 22] with 16 unique values 
DaysWithFailedDataBackups: [0 , 13] with 14 unique values 
DaysWithFailedfulLogBackups: [0 , 14] with 15 unique values 
MinDailyNumberOfSuccessfulDataBackups: [0 , 17] with 17 unique values 
MinDailyNumberOfSuccessfulLogBackups: [0 , 7156] with 1321 unique values 
MaxDailyNumberOfFailedDataBackups: [0 , 63] with 24 unique values 
MaxDailyNumberOfFailedLogBackups: [0 , 66017] with 3431 unique values 
LogSegmentChange: [-16887.0 , 10084.0] with 1328 unique values 
Check1: [0.0 , 1.0] with 2 unique values 
Check2: [0.0 , 1.0] with 2 unique values 
Check3: [0.0 , 1.0] with 2 unique values 
Check4: [0.0 , 1.0] with 2 unique values 
Check5: [0.0 , 1.0] with 2 unique values 
Check6: [0.0 , 1.0] with 2 unique values 
Check7: [0.0 , 1.0] with 2 unique values 
Check8: [0.0 , 1.0] with 2 unique values 

We have compared these information with the data description and had some observations as below:

  • The log is collected from 3188 systems within one year from 06 Jan 2017 to 24 Jan 2018
  • SessionNumber works like an identifer for the records. It doesn't play an important role in the anomaly detection.
  • All logs have zero of cleanup OOM dumps. This feature has no role in our model building process.
  • All logs have zero of restarts of the preprocessor. This feature has no role in our model building process.
  • All logs have zero of restarts of the daemon process. This feature has no role in our model building process.
  • CPU: [0.12 , 4602.49] with 8777 unique values. However in data description it should be in [0, 100]
  • Physical Memory: [ 2.02 , 2,070,680.21] with 9597 unique values. However in the data description it should be in [0, 100]
  • Disk: [0.21 , 373103799682844.5] However in the data description it should be in [0, 100]
  • LogSegmentChange: [-16887.0 , 10084.0] However in the data description it should be positive.
  • All anomaly labels have valid data (0 means normal, 1 means anomaly)

Firstly we check features which have values out of range: CPU, Physical Memory, Disk. All these features should have values in range of [0, 100], however they have very large values compared to 100. We know that these figures are sensitive to system behavior, like a very high memory usage could indicate an anomaly/error. Let's check the distribution of these features and see that they are anomalies or not.

In [7]:
tmp_features = np.array(["CPU", "PhysMEM", "Disk"])

for f in tmp_features:
    # Data without NaN values in feature f
    df = data.dropna(subset = [f])
    # Data with f value < 100
    in_range_records = df.loc[df[f] <= 100]
    # Data with f value > 100
    out_of_range_records = df.loc[df[f] > 100]
    print("%s records with %s smaller than 100 " % (in_range_records.shape[0], f))
    print("%s records with %s larger than 100 " % (out_of_range_records.shape[0], f))
    # Get the labels of data with f > 100
    out_of_range_records = out_of_range_records.iloc[:,36:44].fillna(0)
    print("%s anomaly detections with %s larger than 100 " % (out_of_range_records.max(axis = 1).sum(), f))
    print("============================")

print("Box plot for these features:")
f = pd.melt(data, value_vars=tmp_features)
g = sns.FacetGrid(f, col="variable",sharex=False, sharey=False)
g = g.map(sns.boxplot, "value")
261813 records with CPU smaller than 100 
9 records with CPU larger than 100 
9.0 anomaly detections with CPU larger than 100 
============================
266433 records with PhysMEM smaller than 100 
31 records with PhysMEM larger than 100 
31.0 anomaly detections with PhysMEM larger than 100 
============================
275464 records with Disk smaller than 100 
188 records with Disk larger than 100 
80.0 anomaly detections with Disk larger than 100 
============================
Box plot for these features:

As our thought, 100% out-of-range CPU and Memory records indicate anomaly while this percent in Disk is around 50%.

Now let's check the last field which have out-of-range value: LogSegmentChange

In [8]:
tmp_features = np.array(["LogSegmentChange"])

for f in tmp_features:
    # Data without NaN values in feature f
    df = data.dropna(subset = [f])
    # Data with f value < 100
    in_range_records = df.loc[df[f] >= 0]
    # Data with f value > 100
    out_of_range_records = df.loc[df[f] < 0]
    print("%s records with %s larger than 0 " % (in_range_records.shape[0], f))
    print("%s records with %s smaller than 0 " % (out_of_range_records.shape[0], f))
    # Get the labels of data with f > 0
    in_range_records = in_range_records.iloc[:,36:44].fillna(0)
    print("%s anomaly detections with %s larger than 0 " % (in_range_records.max(axis = 1).sum(), f))
    # Get the labels of data with f < 0
    out_of_range_records = out_of_range_records.iloc[:,36:44].fillna(0)
    print("%s anomaly detections with %s smaller than 0 " % (out_of_range_records.max(axis = 1).sum(), f))
    print("============================")

print("Distribution plot for these features:")
f = pd.melt(data, value_vars=tmp_features)
g = sns.FacetGrid(f, col="variable",sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
245812 records with LogSegmentChange larger than 0 
5670 records with LogSegmentChange smaller than 0 
87759.0 anomaly detections with LogSegmentChange larger than 0 
3040.0 anomaly detections with LogSegmentChange smaller than 0 
============================
Distribution plot for these features:

According to the indicator, we see that there are more than 50% records with negative LogSegmentChange are anomalies. The corresponding ratio on positive LogSegmentChange is about 30%.

2. Labels Overview

In this section we will have a look on types of anomalies. First let's check the number of anomaly detections per each type.

In [9]:
plt.figure(figsize=(15,4))
ax= sns.barplot(labels, data.iloc[:,36:].sum().values)
plt.title("Detections in each anomaly category")
plt.ylabel('Number of detections')
plt.xlabel('Anomaly Types')

#adding the text labels
rects = ax.patches
text_labels = data.iloc[:,36:].sum().values
for rect, label in zip(rects, text_labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')
plt.show()

We see that the Check6 has most detections with 80,572 logs while the second popular one is Check4 with 24,815 detections. There is not much difference between Check2, Check3 and Check7 with around 7,500 - 8,500 detections. Similar to Check5 and Check8 with around 3000 logs. Check1 occurs least with 1,636 anomalies.

Check if there are any logs with more than one type of anomaly detected

In [10]:
rowSums = data.iloc[:,36:].sum(axis=1)

multiLabel_counts = rowSums.value_counts()
multiLabel_counts = multiLabel_counts.iloc[1:]

plt.figure(figsize=(15,4))
ax = sns.barplot(multiLabel_counts.index, multiLabel_counts.values)
plt.title("Multi-anomaly detections")
plt.ylabel('Number of detections')
plt.xlabel('Number of anomalies')

#adding the text labels
rects = ax.patches
text_labels = multiLabel_counts.values
for rect, label in zip(rects, text_labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')
plt.show()

There are 78,435 single anomaly detections and around 22,000 multi-anomaly detections. These detections are combinations of 2 to 7 types of anomaly. Then our problem is the multi-label problem. We check that if there is any correlation between labels.

In [11]:
#labels correlation
correlation_matrix = data.iloc[:,36:].corr()

fig = plt.figure(figsize=(12,9))

sns.heatmap(correlation_matrix,vmax=0.8,square = True,annot = True)

plt.show()

There is a moderate correlation between Check4 ad Check2. Others are all weak positive correlations.

3. Features Analysic

Our problem is to detect right types of anomaly. In this section, we will exploit the features in two perspectives: anomaly vs non-anomaly behaviors, and how our data are different in different types of anomaly.

Now we will build some functions to visualize data statistic:

In [12]:
sns.set()
In [13]:
def logs_per_f(f, graph_type = "bar", figsize=(15,8), data=data):
    '''
    The bar graph describes number of anomaly/normal logs per each value of feature f
    '''
    i = np.where(features == f)[0][0]
    
    # Get the data with f and anomaly info
    tmp_f_data = pd.DataFrame()
    tmp_f_data['Anomaly'] = data.iloc[:,36:44].max(axis=1)
    tmp_f_data[f] = data.iloc[:,i]
    tmp_f_data['tmp'] = data.iloc[:,i]
    
    
    
    # Display the bar graph of anomaly per f
    
    if(graph_type == "bar"):
        tmp_grouped_data = tmp_f_data.pivot_table(index=[f], 
                                              columns=['Anomaly'], 
                                              values='tmp',
                                              fill_value=0, 
                                              aggfunc='count')
        ax = tmp_grouped_data.plot.bar(rot=0,figsize=figsize)
        plt.title("Logs per %s" % f)
        plt.xlabel(f)
    else:
        if(graph_type == "dist"):   
            fig, ax = plt.subplots(figsize=figsize)
            sns.distplot(tmp_f_data.loc[tmp_f_data["Anomaly"] == 0., f], hist=False, rug=True, label="Normal")
            sns.distplot(tmp_f_data.loc[tmp_f_data["Anomaly"] == 1., f], hist=False, rug=True, label="Anomaly")
            plt.title("Logs per %s" % f)
            plt.xlabel(f)
            plt.show()

    
In [14]:
def anomaly_type_per_f(f, graph_type="bar", figsize=(15,8), data=data):
    '''
    The bar graph expresses the number of different types of anomaly per each value of feature f
    The dist graph expresses the distribution of feature f in different types of anomaly
    '''
    i = np.where(features == f)[0][0]
    tmp_f_data = data.iloc[:,36:44]
    tmp_f_data[f] = data.iloc[:,i]
    
    if(graph_type == "bar"):
        tmp_f_data = tmp_f_data.groupby(f).sum()
        ax = tmp_f_data.plot.bar(rot=0,figsize=figsize)
        plt.title("Anomalies types per %s" % f)
        plt.xlabel(f)
    else:
        if(graph_type == "dist"):   
            fig, ax = plt.subplots(figsize=figsize)
            for l in labels:
                sns.distplot(tmp_f_data.loc[tmp_f_data[l] == 1., f], hist=False, rug=True, label=l)
            plt.title("%s Distributions over Anomalies types" % f)
            plt.xlabel(f)
            plt.show()
    
In [15]:
def anomaly_per_f(f, graph_type="bar", top=0, data=data):
    '''
    The bar graph describes number of anomalies through the range value of feature f
    '''
    i = np.where(features == f)[0][0]
    
    tmp_f_data = pd.DataFrame()
    tmp_f_data['Anomaly'] = data.iloc[:,36:44].max(axis=1)
    tmp_f_data[f] = data.iloc[:,i]

    anomaly_per_f= (tmp_f_data.groupby(f).sum())
    
    if (graph_type == "dist"):
        sns.distplot(anomaly_per_f)
#         ax = anomaly_per_f.plot.hist(anomaly_per_f,figsize=(15,8))
        plt.title("Anomalies per %s" % f)
        # Number of systems which have no anomalies
        print("Number of %s which have no anomalies: %s" 
          % (f, anomaly_per_f.loc[anomaly_per_f['Anomaly'] == 0.].count().values))
    else: 
        if (graph_type == "bar"):
            if(top > 0):
                anomaly_per_f = anomaly_per_f.nlargest(50, 'Anomaly')
                title = "Top " + str(top) + " Anomalies per " + f
            else:
                title = "Anomalies per " + f
            ax = anomaly_per_f.plot.bar(rot=0,figsize=(15,8))
            plt.title(title)
    plt.show()
    
In [16]:
def bar_plot(labels, values, title, xlabel, ylabel, size=(15,8)):
    sns.set(font_scale = 1)
    plt.figure(figsize=size)
    ax= sns.barplot(labels, values)
    ax.xaxis_date()
    plt.title(title, fontsize=18)
    plt.ylabel(ylabel, fontsize=18)
    plt.xlabel(xlabel, fontsize=18)
    #adding the text labels
    rects = ax.patches
    for rect, label in zip(rects, values):
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom', fontsize=18)
    plt.show()

3.1. Anomaly vs Non-anomaly data

After the first part, we have some ideas about which features play an important role to the anomaly detection, as well as features which have no role at all. Now we do a deeper analysic on features.

First let's check the distribution of all features values except Date:

In [17]:
f = pd.melt(data, value_vars=np.delete(features, 2))
g = sns.FacetGrid(f, col="variable",  col_wrap=3, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

Our problem is to detect anomalies, therefore features with uniform distribution may not bring much values to us.

* Categorical features

As we know logs are collected in one year. We will group Date into month/day/hour to see the role of this field on anomaly detection.

In [18]:
# Convert into month
tmp_date = data.iloc[:,2]
In [19]:
# Convert into month
data['Date'] = tmp_date.dt.month
logs_per_f('Date', "bar", figsize=(15,4))
data['Date'] = tmp_date.dt.day
logs_per_f('Date', "bar", figsize=(15,4))
data['Date'] = tmp_date.dt.hour
logs_per_f('Date', "bar", figsize=(15,4))

From these graphs we obserse that the ratio of anomalies to normal behaviors per month, day or hour is all around 50%. There are some periods which have more logs than others but the anomaly rate doesn't change much. For example in Oct and Nov, in the begin of week or at around 4h.

It seems that Date is not a key feature in anomaly detection. However at this step we still keep this feature and use only hour value instead of date.

In [20]:
# Convert into month
data['Date'] = tmp_date.dt.hour

del(tmp_date)

Check the number of anomaly detection on MergeErrors

In [21]:
for f in ["MergeErrors"]:
    logs_per_f(f, "bar", figsize=(15,4))

From these graphs we obserse that anomalies are likely detected when there are errors in merge process

* Numerical features

Now we check the distribution of features in anomaly/normal logs. SystemID is a categorical feature, however there are more than 3,000 systems then we put it here to easily generate graph.

In [22]:
numerical_features = np.array([
    "SystemID",
    "HighPriorityAlerts",
    "Dumps",
    "CompositeOOMDums",
    "IndexServerRestarts",
    "NameServerRestarts",
    "XSEngineRestarts",
    "StatisticsServerRestarts",
    "CPU",
    "PhysMEM",
    "InstanceMEM",
    "TablesAllocation",
    "IndexServerAllocationLimit",
    "ColumnUnloads",
    "DeltaSize",
    "BlockingPhaseSec",
    "Disk",
    "LargestTableSize",
    "LargestPartitionSize",
    "DiagnosisFiles",
    "DiagnosisFilesSize",
    "DaysWithSuccessfulDataBackups",
    "DaysWithSuccessfulLogBackups",
    "DaysWithFailedDataBackups",
    "DaysWithFailedfulLogBackups",
    "MinDailyNumberOfSuccessfulDataBackups",
    "MinDailyNumberOfSuccessfulLogBackups",
    "MaxDailyNumberOfFailedDataBackups",
    "MaxDailyNumberOfFailedLogBackups",
    "LogSegmentChange",
])

for f in numerical_features:
    logs_per_f(f, graph_type="dist", figsize=(15,4))

Check these graphs give us below observations:

  • Ouliers indicate anomalies: HighPriorityAlerts, Dumps, CompositeOOMDums, IndexServerRestarts, NameServerRestarts, XSEngineRestarts, CPU, PhysMEM, DiagnosisFiles, DiagnosisFilesSize
  • Two distributions have similar shapes but anomaly one has higher probability: StatisticsServerRestarts, ColumnUnloads, MaxDailyNumberOfFailedDataBackups
  • Two distributions are skewed in two different directions indicates that higher value leads to higher anomaly probability: InstanceMEM, TablesAllocation, IndexServerAllocationLimit, BlockingPhaseSec, Disk, MaxDailyNumberOfFailedDataBackups
  • Two distributions have similar shapes, normal one has higher probability than anomaly one but it's reserved at the tail of distributions: LargeTableSize, LargePartitionSize
  • Two distributions have similar shapes: DaysWithSuccessfulDataBackups, DaysWithFailedDataBackups, MaxDailyNumberOfFailedLogBackups

These information could be useful when we handle missing data.

3.2. Different types of anomaly distribution

* Categorical features

In [23]:
for f in ["Date", "MergeErrors", "HighPriorityAlerts"]:
    anomaly_type_per_f(f, "bar", figsize=(15,4))

* Numerical features

In [24]:
numerical_features = np.array([
    "SystemID",
    "HighPriorityAlerts",
    "Dumps",
    "CompositeOOMDums",
    "IndexServerRestarts",
    "NameServerRestarts",
    "XSEngineRestarts",
    "StatisticsServerRestarts",
    "CPU",
    "PhysMEM",
    "InstanceMEM",
    "TablesAllocation",
    "IndexServerAllocationLimit",
    "ColumnUnloads",
    "DeltaSize",
    "BlockingPhaseSec",
    "Disk",
    "LargestTableSize",
    "LargestPartitionSize",
    "DiagnosisFiles",
    "DiagnosisFilesSize",
    "DaysWithSuccessfulDataBackups",
    "DaysWithSuccessfulLogBackups",
    "DaysWithFailedDataBackups",
    "DaysWithFailedfulLogBackups",
    "MinDailyNumberOfSuccessfulDataBackups",
    "MinDailyNumberOfSuccessfulLogBackups",
    "MaxDailyNumberOfFailedDataBackups",
    "MaxDailyNumberOfFailedLogBackups",
    "LogSegmentChange",
])

for f in numerical_features:
    anomaly_type_per_f(f, graph_type="dist", figsize=(15,4))

We see that the order of different anomalies don't change much through months.

In [99]:
# def dist_plot(x1, x2):
#     sns.distplot(x1)
#     sns.distplot(x2)
#     plt.legend(loc='upper right')
#     plt.show()
    
# # for f in features2:
# #     print(f, ":")
# #     display(data[f].unique())
# #     data[f].plot()
# #     plt.plot(data[f])
# #     plt.show()

# # display(data["PhysMEM"].unique())
# # display(max(data["PhysMEM"]))
# # f = pd.melt(data.loc[data['Error'] == 1.], value_vars=np.delete(features, 2))
# # f = pd.melt(data, id_vars=['Error'], value_vars=np.delete(features, 2))
# # display(f)
# # g = sns.FacetGrid(f, row="variable", col="Error")
# # g = g.map(dist_plot, "value")
# # plt.show()

# # for f in features:
# #     g = sns.FacetGrid(data, col="Error",  row=f)
# #     g = g.map(plt.hist, "total_bill")
# f = pd.DataFrame()
# f_error = pd.melt(data.loc[data['Error'] == 1.], value_vars=np.delete(features, 2))
# f['feature', 'error'] = f_error['variable', 'value']
# f_normal = pd.melt(data.loc[data['Error'] == 0.], value_vars=np.delete(features, 2))
# f1 = pd.DataFrame()
# f1['feature', 'normal', 'error'] = f 
# g = sns.FacetGrid(f, col="variable", col_wrap=4, sharex=False, sharey=False)
# g = g.map(sns.distplot, "value")

4. Correlation

We check correlation between features and features as well as features and labels

In [16]:
#correlation
correlation_matrix = data.corr()

fig = plt.figure(figsize=(12,9))

sns.heatmap(correlation_matrix,vmax=0.8,square = True)

plt.show()

We observe that there are some strong correlation between:

  • PreprocessorRestarts, DaemonRestarts: it has no meaning because they are all 0 as we checked in previous parts.
  • InstanceMEM, TablesAllocation, IndexServerAllocationLimit: let's plot them in pairs.
In [17]:
tmp_f_data = pd.DataFrame()
tmp_f_data['Anomaly'] = data.iloc[:,36:44].max(axis=1)
cols = ['InstanceMEM', 'TablesAllocation', 'IndexServerAllocationLimit']
for c in cols:
    tmp_f_data[c] = data[c]
sns.pairplot(tmp_f_data.dropna(), hue="Anomaly", vars=cols, size = 4.5)
plt.show();

Data Pre-processing


The previous step should give you a better understanding of which pre-processing is required for the data. This may include:

  • Normalising and standardising the given data;
  • Removing outliers;
  • Carrying out feature selection;
  • Handling missing information in the dataset;
  • Handling errors in the dataset;
  • Combining existing features.

1. Convert data type

In [25]:
# List of features which should be in type of integer
integer_features = np.array([
    "SessionNumber",
    "SystemID",
    "HighPriorityAlerts",
    "Dumps",
    "CleanupOOMDumps",
    "CompositeOOMDums",
    "IndexServerRestarts",
    "NameServerRestarts",
    "XSEngineRestarts",
    "PreprocessorRestarts",
    "DaemonRestarts",
    "StatisticsServerRestarts",
    "ColumnUnloads",
    "DeltaSize",
    "MergeErrors",
    "BlockingPhaseSec",
    "LargestTableSize",
    "LargestPartitionSize",
    "DiagnosisFiles",
    "DiagnosisFilesSize",
    "DaysWithSuccessfulDataBackups",
    "DaysWithSuccessfulLogBackups",
    "DaysWithFailedDataBackups",
    "DaysWithFailedfulLogBackups",
    "MinDailyNumberOfSuccessfulDataBackups",
    "MinDailyNumberOfSuccessfulLogBackups",
    "MaxDailyNumberOfFailedDataBackups",
    "MaxDailyNumberOfFailedLogBackups",
    "LogSegmentChange",
])

# Cast into integer type
# for f in integer_features:
#     data[f] = data[f].astype('int64') 

1. Handle missing values

Check how many missing values we have in the data

In [26]:
data["Anomaly"] = data.iloc[:,36:44].max(axis=1)
In [27]:
#missing data
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(30)
Out[27]:
Total Percent
BlockingPhaseSec 75854 0.264271
Check4 36647 0.127676
Check7 35722 0.124453
LogSegmentChange 35549 0.123851
Check5 35034 0.122057
IndexServerAllocationLimit 26444 0.092129
CPU 25209 0.087827
Check1 24511 0.085395
Check2 24486 0.085308
Check3 22568 0.078626
InstanceMEM 22117 0.077054
DiagnosisFiles 21923 0.076379
DiagnosisFilesSize 21923 0.076379
PhysMEM 20567 0.071654
LargestTableSize 16250 0.056614
Disk 11379 0.039644
MergeErrors 7733 0.026941
Check6 7384 0.025725
TablesAllocation 2290 0.007978
DeltaSize 206 0.000718
LargestPartitionSize 150 0.000523
Check8 52 0.000181
Anomaly 52 0.000181
Dumps 3 0.000010
CleanupOOMDumps 3 0.000010
CompositeOOMDums 3 0.000010
DaysWithSuccessfulDataBackups 0 0.000000
DaysWithSuccessfulLogBackups 0 0.000000
SystemID 0 0.000000
Date 0 0.000000

We will delete all records which have no information about any labels (NaN for all labels)

In [28]:
data = data.dropna(subset=['Anomaly'])

* Categorical features

For MergeErrors, if the record is an anomaly we put it as 1 (the most frequently value), 0 otherwise.

In [29]:
f = "MergeErrors"

data[f] = data[f].fillna(data["Anomaly"])

print(data[f].isnull().sum())
0

* Numerical features

For numerical features, if the record is an anomaly we replace it by the median of anomaly values, otherwise by mean of normal values.

In [30]:
nan_features = ['BlockingPhaseSec',
                'LogSegmentChange', 
                'IndexServerAllocationLimit', 
                'CPU',
                'InstanceMEM', 
                'DiagnosisFiles', 
                'DiagnosisFilesSize', 
                'PhysMEM',
                'LargestTableSize', 
                'Disk', 
                'TablesAllocation',
                'DeltaSize', 
                'LargestPartitionSize', 
                'Dumps',
                'CleanupOOMDumps', 
                'CompositeOOMDums']

for f in nan_features:
    data.loc[data["Anomaly"] == 1. , f] = data.loc[data["Anomaly"] == 1. , f].fillna(data.loc[data["Anomaly"] == 1. , f].median())
    data.loc[data["Anomaly"] == 0. , f] = data.loc[data["Anomaly"] == 0. , f].fillna(data.loc[data["Anomaly"] == 0. , f].median())
In [31]:
#missing data
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(10)
Out[31]:
Total Percent
Check4 36595 0.127518
Check7 35670 0.124295
Check5 34982 0.121897
Check1 24459 0.085229
Check2 24434 0.085142
Check3 22516 0.078459
Check6 7332 0.025549
Anomaly 0 0.000000
XSEngineRestarts 0 0.000000
TablesAllocation 0 0.000000
In [32]:
#missing data
total = data.isnull().sum().sort_values(ascending=False)

percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(10)
Out[32]:
Total Percent
Check4 36595 0.127518
Check7 35670 0.124295
Check5 34982 0.121897
Check1 24459 0.085229
Check2 24434 0.085142
Check3 22516 0.078459
Check6 7332 0.025549
Anomaly 0 0.000000
XSEngineRestarts 0 0.000000
TablesAllocation 0 0.000000
In [33]:
#missing data
data = data.fillna(0)

2. New features

Now we will cast integer values into their right type of data

In [34]:
# List of features which should be in type of integer
integer_features = np.array([
    "SessionNumber",
    "SystemID",
    "HighPriorityAlerts",
    "Dumps",
    "CleanupOOMDumps",
    "CompositeOOMDums",
    "IndexServerRestarts",
    "NameServerRestarts",
    "XSEngineRestarts",
    "PreprocessorRestarts",
    "DaemonRestarts",
    "StatisticsServerRestarts",
    "ColumnUnloads",
    "DeltaSize",
    "MergeErrors",
    "BlockingPhaseSec",
    "LargestTableSize",
    "LargestPartitionSize",
    "DiagnosisFiles",
    "DiagnosisFilesSize",
    "DaysWithSuccessfulDataBackups",
    "DaysWithSuccessfulLogBackups",
    "DaysWithFailedDataBackups",
    "DaysWithFailedfulLogBackups",
    "MinDailyNumberOfSuccessfulDataBackups",
    "MinDailyNumberOfSuccessfulLogBackups",
    "MaxDailyNumberOfFailedDataBackups",
    "MaxDailyNumberOfFailedLogBackups",
    "LogSegmentChange",
])

# Cast into integer type
for f in integer_features:
    data[f] = data[f].astype('int64') 

Remove some features:

In [35]:
data = data.drop(['SessionNumber', 'CleanupOOMDumps', 'PreprocessorRestarts', 'DaemonRestarts', 'Anomaly'], axis=1)

Model Selection


In this section, we find suitable models and do experiments on them.

Our problem is a multi-label classification with a moderate correlation between 2 of 8 labels. There are 2 main approaches to resolve such problems:

  • Algorithm adaptation methods: we treat the whole problem with a specific algorithm. It means that each combination of labels becomes a new target.

  • Problem transformation methods: transform the multi-label problems into multi single-label problems.

    • Binary relevance
    • Classifier chains

For the anomaly detection problem, we choose DecisionTree as the core model to build because its advantages as below:

  • Simple to understand and to interpret. Trees can be visualised.
  • Requires little data preparation. It doesn't require data normalisation or dummy variables creation.
  • Able to handle both numerical and categorical data.
  • Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic.
  • Support multi-label classification
In [72]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.metrics import f1_score
# from sklearn.tree import 

1. Single multi-label classifier

1.1. Data split

In [64]:
df = data.dropna(subset=labels)

y = (df[labels]).as_matrix()
df = df.drop(labels, axis=1)
# df = df.drop(['Anomaly'], axis = 1)

display(df[0:5])
display(y[0:5])
SystemID Date HighPriorityAlerts Dumps CompositeOOMDums IndexServerRestarts NameServerRestarts XSEngineRestarts StatisticsServerRestarts CPU PhysMEM InstanceMEM TablesAllocation IndexServerAllocationLimit ColumnUnloads DeltaSize MergeErrors BlockingPhaseSec Disk LargestTableSize LargestPartitionSize DiagnosisFiles DiagnosisFilesSize DaysWithSuccessfulDataBackups DaysWithSuccessfulLogBackups DaysWithFailedDataBackups DaysWithFailedfulLogBackups MinDailyNumberOfSuccessfulDataBackups MinDailyNumberOfSuccessfulLogBackups MaxDailyNumberOfFailedDataBackups MaxDailyNumberOfFailedLogBackups LogSegmentChange
0 0 4 1 0 0 0 0 0 0 4.77 61.86 37.48 0.0 39.97 0 52884993 0 10 65.69 606600 6804 79 444366335 7 8 0 0 1 32 0 0 0
1 1 4 0 0 0 0 0 0 0 1.05 32.82 12.77 0.0 39.97 0 65546255 0 10 45.60 1818555 6804 54 227400051 3 8 0 0 1 32 0 0 0
2 1 4 0 0 0 0 0 0 0 0.66 35.16 13.00 0.0 39.97 0 59582212 0 10 18.94 1818505 6804 54 234913753 3 8 0 0 1 32 0 0 0
3 2 4 1 0 0 0 0 0 0 3.17 82.93 52.94 0.0 39.97 0 48229160 0 10 40.29 695934 6804 91 511053878 7 8 0 0 1 38 0 0 0
4 3 4 1 0 0 0 0 0 0 2.92 76.18 20.51 0.0 39.97 0 79452443 0 10 49.83 959031 6804 55 172953445 7 8 0 0 1 5 0 0 0
array([[0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.]])
In [65]:
X_train, X_test, y_train, y_test = train_test_split(df, y, train_size=0.8, test_size=0.2)

1.2. Model Training

In [66]:
model = DecisionTreeClassifier(max_depth=8, max_features=15)
model.fit(X_train, y_train)

# tree.plot_tree(model.fit(df1, y1)) 
Out[66]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=8,
            max_features=15, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

1.3. Model Evaluation

In [68]:
export_graphviz(model) 

print("score: ", model.score(X_test, y_test))

y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred, average='macro')  

print("f1 score: ", f1)
score:  0.963185587845843
f1 score:  0.8312988290041592
/usr/local/lib/python3.5/dist-packages/sklearn/tree/export.py:399: DeprecationWarning: out_file can be set to None starting from 0.18. This will be the default in 0.20.
  DeprecationWarning)

2. Binary relevance

In [69]:
y = data[labels]
df = data.drop(labels, axis=1)
In [70]:
X_train, X_test, y_train, y_test = train_test_split(df, y, train_size=0.8, test_size=0.2)
In [74]:
import sklearn.metrics

# assume data is loaded using
# and is available in X_train/X_test, y_train/y_test

# initialize Binary Relevance multi-label classifier
# with gaussian naive bayes base classifier
classifier = BinaryRelevance(DecisionTreeClassifier(max_depth=8, max_features=15))

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)
display(predictions[:10].todense())
# measure
print(sklearn.metrics.f1_score(y_test, predictions, average='macro'))
matrix([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])
0.9225557659810034

3. Classifier chains

In [41]:
data1 = data
# data1 = data1.drop(['Date','SessionNumber'], axis=1)
data1 = data1.fillna(0)
data1[labels] = data1[labels].fillna(0)

# data1 = data1.dropna()
# print("after drop: ", data1.shape)
# display(data1[28:30])
y1 = (data1[labels]).as_matrix()
# df1 = data1.drop("Date", axis=1)
df1 = data1.drop(labels, axis=1)

display(df1[28:30])
display(y1[20:40])
SessionNumber SystemID Date HighPriorityAlerts Dumps CleanupOOMDumps CompositeOOMDums IndexServerRestarts NameServerRestarts XSEngineRestarts PreprocessorRestarts DaemonRestarts StatisticsServerRestarts CPU PhysMEM InstanceMEM TablesAllocation IndexServerAllocationLimit ColumnUnloads DeltaSize MergeErrors BlockingPhaseSec Disk LargestTableSize LargestPartitionSize DiagnosisFiles DiagnosisFilesSize DaysWithSuccessfulDataBackups DaysWithSuccessfulLogBackups DaysWithFailedDataBackups DaysWithFailedfulLogBackups MinDailyNumberOfSuccessfulDataBackups MinDailyNumberOfSuccessfulLogBackups MaxDailyNumberOfFailedDataBackups MaxDailyNumberOfFailedLogBackups LogSegmentChange
28 28 10 1 1 0.0 0.0 0.0 0 0 0 0 0 0 7.40 26.63 59.16 0.0 0.0 0 64989788.0 0.0 0.0 55.99 1378393.0 6804.0 156.0 345010186.0 7 8 0 0 1 10 0 0 0.0
29 29 10 2 2 0.0 0.0 0.0 0 0 0 0 0 0 16.08 44.15 21.03 0.0 0.0 0 64646209.0 0.0 0.0 55.99 1450712.0 6804.0 121.0 341814153.0 7 8 0 0 1 10 0 0 0.0
array([[0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.]])
In [99]:
display(predictions[:100].todense())
matrix([[0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 0., 1., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 1., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 0., 1., 0., 0., 0., 0.],
        [0., 0., 1., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 1., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 1., 1., 1., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 1., 0., 1., 0., 1., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.]])
In [50]:
from sklearn.metrics import f1_score
from sklearn import tree

tree.export_graphviz(model) 

print("score: ", model.score(X_test, y_test))

y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred, average='macro')  

print("f1 score: ", f1)

display(y_pred[:10])
display(y_test[:10])
/usr/local/lib/python3.5/dist-packages/sklearn/tree/export.py:399: DeprecationWarning: out_file can be set to None starting from 0.18. This will be the default in 0.20.
  DeprecationWarning)
score:  0.9560332363649032
f1 score:  0.8990639861898473
array([[0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0.]])
array([[0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0.]])
In [102]:
from sklearn.tree import export_graphviz

export_graphviz(model)
In [51]:
def tree_analysis(estimator):
    n_nodes = estimator.tree_.node_count
    children_left = estimator.tree_.children_left
    children_right = estimator.tree_.children_right
    feature = estimator.tree_.feature
    threshold = estimator.tree_.threshold

#     print("The binary tree structure has %s nodes, %s children left, %s children right %s feature, %s threshold"
#           % (n_nodes, children_left.size, children_right.size, feature.size, threshold))
    # The tree structure can be traversed to compute various properties such
    # as the depth of each node and whether or not it is a leaf.
    node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
    is_leaves = np.zeros(shape=n_nodes, dtype=bool)
    stack = [(0, -1)]  # seed is the root node id and its parent depth
    while len(stack) > 0:
        node_id, parent_depth = stack.pop()
        node_depth[node_id] = parent_depth + 1

        # If we have a test node
        if (children_left[node_id] != children_right[node_id]):
            stack.append((children_left[node_id], parent_depth + 1))
            stack.append((children_right[node_id], parent_depth + 1))
        else:
            is_leaves[node_id] = True

    print("The binary tree structure has %s nodes and has "
          "the following tree structure:"
          % n_nodes)
    for i in range(n_nodes):
        if is_leaves[i]:
            print("%snode=%s leaf node." % (node_depth[i] * "\t", i))
        else:
            print("%snode=%s test node: go to node %s if X[:, %s] <= %s else to "
                  "node %s."
                  % (node_depth[i] * "\t",
                     i,
                     children_left[i],
                     feature[i],
                     threshold[i],
                     children_right[i],
                     ))
    print()

    # First let's retrieve the decision path of each sample. The decision_path
    # method allows to retrieve the node indicator functions. A non zero element of
    # indicator matrix at the position (i, j) indicates that the sample i goes
    # through the node j.

#     node_indicator = estimator.decision_path(X_test)

#     # Similarly, we can also have the leaves ids reached by each sample.

#     leave_id = estimator.apply(X_test)

#     # Now, it's possible to get the tests that were used to predict a sample or
#     # a group of samples. First, let's make it for the sample.

#     sample_id = 0
#     node_index = node_indicator.indices[node_indicator.indptr[sample_id]:
#                                         node_indicator.indptr[sample_id + 1]]

#     print('Rules used to predict sample %s: ' % sample_id)
#     for node_id in node_index:
#         if leave_id[sample_id] == node_id:
#             continue

#         if (X_test[sample_id, feature[node_id]] <= threshold[node_id]):
#             threshold_sign = "<="
#         else:
#             threshold_sign = ">"

#         print("decision id node %s : (X_test[%s, %s] (= %s) %s %s)"
#               % (node_id,
#                  sample_id,
#                  feature[node_id],
#                  X_test[sample_id, feature[node_id]],
#                  threshold_sign,
#                  threshold[node_id]))

#     # For a group of samples, we have the following common node.
#     sample_ids = [0, 1]
#     common_nodes = (node_indicator.toarray()[sample_ids].sum(axis=0) ==
#                     len(sample_ids))

#     common_node_id = np.arange(n_nodes)[common_nodes]

#     print("\nThe following samples %s share the node %s in the tree"
#           % (sample_ids, common_node_id))
#     print("It is %s %% of all nodes." % (100 * len(common_node_id) / n_nodes,))
In [52]:
tree_analysis(model)
The binary tree structure has 1195 nodes and has the following tree structure:
node=0 test node: go to node 1 if X[:, 17] <= 90.0050048828125 else to node 580.
	node=1 test node: go to node 2 if X[:, 4] <= 0.5 else to node 339.
		node=2 test node: go to node 3 if X[:, 13] <= 89.99500274658203 else to node 228.
			node=3 test node: go to node 4 if X[:, 3] <= 3.5 else to node 109.
				node=4 test node: go to node 5 if X[:, 25] <= 150.5 else to node 66.
					node=5 test node: go to node 6 if X[:, 35] <= 8.5 else to node 35.
						node=6 test node: go to node 7 if X[:, 25] <= 1.5 else to node 20.
							node=7 test node: go to node 8 if X[:, 28] <= 5.5 else to node 13.
								node=8 test node: go to node 9 if X[:, 7] <= 0.5 else to node 12.
									node=9 test node: go to node 10 if X[:, 34] <= 3461.5 else to node 11.
										node=10 leaf node.
										node=11 leaf node.
									node=12 leaf node.
								node=13 test node: go to node 14 if X[:, 17] <= 0.3149999976158142 else to node 17.
									node=14 test node: go to node 15 if X[:, 24] <= 46460272.0 else to node 16.
										node=15 leaf node.
										node=16 leaf node.
									node=17 test node: go to node 18 if X[:, 24] <= 196490192.0 else to node 19.
										node=18 leaf node.
										node=19 leaf node.
							node=20 test node: go to node 21 if X[:, 14] <= 94.9949951171875 else to node 28.
								node=21 test node: go to node 22 if X[:, 26] <= 6424178688.0 else to node 25.
									node=22 test node: go to node 23 if X[:, 7] <= 2.5 else to node 24.
										node=23 leaf node.
										node=24 leaf node.
									node=25 test node: go to node 26 if X[:, 19] <= 18227441664.0 else to node 27.
										node=26 leaf node.
										node=27 leaf node.
								node=28 test node: go to node 29 if X[:, 22] <= 4.625 else to node 32.
									node=29 test node: go to node 30 if X[:, 17] <= 34.81999969482422 else to node 31.
										node=30 leaf node.
										node=31 leaf node.
									node=32 test node: go to node 33 if X[:, 35] <= -9.5 else to node 34.
										node=33 leaf node.
										node=34 leaf node.
						node=35 test node: go to node 36 if X[:, 21] <= 3.5 else to node 51.
							node=36 test node: go to node 37 if X[:, 35] <= 114.0 else to node 44.
								node=37 test node: go to node 38 if X[:, 23] <= 1089742.5 else to node 41.
									node=38 test node: go to node 39 if X[:, 19] <= 68695472.0 else to node 40.
										node=39 leaf node.
										node=40 leaf node.
									node=41 test node: go to node 42 if X[:, 30] <= 2.5 else to node 43.
										node=42 leaf node.
										node=43 leaf node.
								node=44 test node: go to node 45 if X[:, 13] <= 1.0450000762939453 else to node 48.
									node=45 test node: go to node 46 if X[:, 16] <= 1.4900000095367432 else to node 47.
										node=46 leaf node.
										node=47 leaf node.
									node=48 test node: go to node 49 if X[:, 13] <= 39.58000183105469 else to node 50.
										node=49 leaf node.
										node=50 leaf node.
							node=51 test node: go to node 52 if X[:, 16] <= 10.229999542236328 else to node 59.
								node=52 test node: go to node 53 if X[:, 26] <= 1568493056.0 else to node 56.
									node=53 test node: go to node 54 if X[:, 34] <= 0.5 else to node 55.
										node=54 leaf node.
										node=55 leaf node.
									node=56 test node: go to node 57 if X[:, 16] <= 7.904999732971191 else to node 58.
										node=57 leaf node.
										node=58 leaf node.
								node=59 test node: go to node 60 if X[:, 35] <= 66.5 else to node 63.
									node=60 test node: go to node 61 if X[:, 32] <= 2.5 else to node 62.
										node=61 leaf node.
										node=62 leaf node.
									node=63 test node: go to node 64 if X[:, 19] <= 15654749184.0 else to node 65.
										node=64 leaf node.
										node=65 leaf node.
					node=66 test node: go to node 67 if X[:, 14] <= 94.9949951171875 else to node 96.
						node=67 test node: go to node 68 if X[:, 35] <= 9.5 else to node 81.
							node=68 test node: go to node 69 if X[:, 15] <= 90.01499938964844 else to node 76.
								node=69 test node: go to node 70 if X[:, 7] <= 1.5 else to node 73.
									node=70 test node: go to node 71 if X[:, 8] <= 2.5 else to node 72.
										node=71 leaf node.
										node=72 leaf node.
									node=73 test node: go to node 74 if X[:, 13] <= 8.0 else to node 75.
										node=74 leaf node.
										node=75 leaf node.
								node=76 test node: go to node 77 if X[:, 16] <= 6.385000228881836 else to node 78.
									node=77 leaf node.
									node=78 test node: go to node 79 if X[:, 7] <= 1.0 else to node 80.
										node=79 leaf node.
										node=80 leaf node.
							node=81 test node: go to node 82 if X[:, 19] <= 595204736.0 else to node 89.
								node=82 test node: go to node 83 if X[:, 15] <= 37.32499694824219 else to node 86.
									node=83 test node: go to node 84 if X[:, 35] <= 18.5 else to node 85.
										node=84 leaf node.
										node=85 leaf node.
									node=86 test node: go to node 87 if X[:, 14] <= 53.43499755859375 else to node 88.
										node=87 leaf node.
										node=88 leaf node.
								node=89 test node: go to node 90 if X[:, 13] <= 2.869999885559082 else to node 93.
									node=90 test node: go to node 91 if X[:, 35] <= 89.0 else to node 92.
										node=91 leaf node.
										node=92 leaf node.
									node=93 test node: go to node 94 if X[:, 35] <= 205.0 else to node 95.
										node=94 leaf node.
										node=95 leaf node.
						node=96 test node: go to node 97 if X[:, 24] <= 65867.0 else to node 100.
							node=97 test node: go to node 98 if X[:, 8] <= 0.5 else to node 99.
								node=98 leaf node.
								node=99 leaf node.
							node=100 test node: go to node 101 if X[:, 30] <= 5.5 else to node 106.
								node=101 test node: go to node 102 if X[:, 34] <= 13.0 else to node 103.
									node=102 leaf node.
									node=103 test node: go to node 104 if X[:, 13] <= 1.8550000190734863 else to node 105.
										node=104 leaf node.
										node=105 leaf node.
								node=106 test node: go to node 107 if X[:, 32] <= 115.0 else to node 108.
									node=107 leaf node.
									node=108 leaf node.
				node=109 test node: go to node 110 if X[:, 35] <= 8.5 else to node 167.
					node=110 test node: go to node 111 if X[:, 7] <= 2.5 else to node 142.
						node=111 test node: go to node 112 if X[:, 35] <= -0.5 else to node 127.
							node=112 test node: go to node 113 if X[:, 25] <= 150.5 else to node 120.
								node=113 test node: go to node 114 if X[:, 26] <= 1001023.0 else to node 117.
									node=114 test node: go to node 115 if X[:, 16] <= 35.24500274658203 else to node 116.
										node=115 leaf node.
										node=116 leaf node.
									node=117 test node: go to node 118 if X[:, 21] <= 1561.0 else to node 119.
										node=118 leaf node.
										node=119 leaf node.
								node=120 test node: go to node 121 if X[:, 19] <= 49380184064.0 else to node 124.
									node=121 test node: go to node 122 if X[:, 14] <= 94.95500183105469 else to node 123.
										node=122 leaf node.
										node=123 leaf node.
									node=124 test node: go to node 125 if X[:, 14] <= 95.04000091552734 else to node 126.
										node=125 leaf node.
										node=126 leaf node.
							node=127 test node: go to node 128 if X[:, 25] <= 150.5 else to node 135.
								node=128 test node: go to node 129 if X[:, 14] <= 94.9949951171875 else to node 132.
									node=129 test node: go to node 130 if X[:, 25] <= 2.0 else to node 131.
										node=130 leaf node.
										node=131 leaf node.
									node=132 test node: go to node 133 if X[:, 32] <= 163.5 else to node 134.
										node=133 leaf node.
										node=134 leaf node.
								node=135 test node: go to node 136 if X[:, 14] <= 95.00999450683594 else to node 139.
									node=136 test node: go to node 137 if X[:, 8] <= 1.5 else to node 138.
										node=137 leaf node.
										node=138 leaf node.
									node=139 test node: go to node 140 if X[:, 15] <= 89.0050048828125 else to node 141.
										node=140 leaf node.
										node=141 leaf node.
						node=142 test node: go to node 143 if X[:, 24] <= 34562596.0 else to node 152.
							node=143 test node: go to node 144 if X[:, 14] <= 94.5199966430664 else to node 149.
								node=144 test node: go to node 145 if X[:, 25] <= 151.0 else to node 148.
									node=145 test node: go to node 146 if X[:, 14] <= 31.75 else to node 147.
										node=146 leaf node.
										node=147 leaf node.
									node=148 leaf node.
								node=149 test node: go to node 150 if X[:, 30] <= 4.0 else to node 151.
									node=150 leaf node.
									node=151 leaf node.
							node=152 test node: go to node 153 if X[:, 25] <= 151.0 else to node 160.
								node=153 test node: go to node 154 if X[:, 14] <= 95.13999938964844 else to node 157.
									node=154 test node: go to node 155 if X[:, 32] <= 2.0 else to node 156.
										node=155 leaf node.
										node=156 leaf node.
									node=157 test node: go to node 158 if X[:, 26] <= 35659528.0 else to node 159.
										node=158 leaf node.
										node=159 leaf node.
								node=160 test node: go to node 161 if X[:, 14] <= 95.0 else to node 164.
									node=161 test node: go to node 162 if X[:, 7] <= 26.0 else to node 163.
										node=162 leaf node.
										node=163 leaf node.
									node=164 test node: go to node 165 if X[:, 15] <= 81.51499938964844 else to node 166.
										node=165 leaf node.
										node=166 leaf node.
					node=167 test node: go to node 168 if X[:, 26] <= 923063424.0 else to node 197.
						node=168 test node: go to node 169 if X[:, 25] <= 150.5 else to node 184.
							node=169 test node: go to node 170 if X[:, 26] <= 904624.5 else to node 177.
								node=170 test node: go to node 171 if X[:, 35] <= 39.5 else to node 174.
									node=171 test node: go to node 172 if X[:, 22] <= 10.925000190734863 else to node 173.
										node=172 leaf node.
										node=173 leaf node.
									node=174 test node: go to node 175 if X[:, 22] <= 75.05999755859375 else to node 176.
										node=175 leaf node.
										node=176 leaf node.
								node=177 test node: go to node 178 if X[:, 7] <= 2.5 else to node 181.
									node=178 test node: go to node 179 if X[:, 27] <= 0.5 else to node 180.
										node=179 leaf node.
										node=180 leaf node.
									node=181 test node: go to node 182 if X[:, 32] <= 69.5 else to node 183.
										node=182 leaf node.
										node=183 leaf node.
							node=184 test node: go to node 185 if X[:, 14] <= 94.73500061035156 else to node 192.
								node=185 test node: go to node 186 if X[:, 28] <= 5.0 else to node 189.
									node=186 test node: go to node 187 if X[:, 14] <= 49.084999084472656 else to node 188.
										node=187 leaf node.
										node=188 leaf node.
									node=189 test node: go to node 190 if X[:, 30] <= 4.5 else to node 191.
										node=190 leaf node.
										node=191 leaf node.
								node=192 test node: go to node 193 if X[:, 26] <= 767797760.0 else to node 194.
									node=193 leaf node.
									node=194 test node: go to node 195 if X[:, 7] <= 1.0 else to node 196.
										node=195 leaf node.
										node=196 leaf node.
						node=197 test node: go to node 198 if X[:, 26] <= 2210244096.0 else to node 213.
							node=198 test node: go to node 199 if X[:, 24] <= 49664848.0 else to node 206.
								node=199 test node: go to node 200 if X[:, 23] <= 31871384.0 else to node 203.
									node=200 test node: go to node 201 if X[:, 15] <= 33.994998931884766 else to node 202.
										node=201 leaf node.
										node=202 leaf node.
									node=203 test node: go to node 204 if X[:, 24] <= 36427432.0 else to node 205.
										node=204 leaf node.
										node=205 leaf node.
								node=206 test node: go to node 207 if X[:, 25] <= 150.5 else to node 210.
									node=207 test node: go to node 208 if X[:, 35] <= 66.5 else to node 209.
										node=208 leaf node.
										node=209 leaf node.
									node=210 test node: go to node 211 if X[:, 35] <= 51.5 else to node 212.
										node=211 leaf node.
										node=212 leaf node.
							node=213 test node: go to node 214 if X[:, 14] <= 94.94999694824219 else to node 221.
								node=214 test node: go to node 215 if X[:, 32] <= 1.5 else to node 218.
									node=215 test node: go to node 216 if X[:, 25] <= 149.5 else to node 217.
										node=216 leaf node.
										node=217 leaf node.
									node=218 test node: go to node 219 if X[:, 25] <= 150.0 else to node 220.
										node=219 leaf node.
										node=220 leaf node.
								node=221 test node: go to node 222 if X[:, 17] <= 52.349998474121094 else to node 225.
									node=222 test node: go to node 223 if X[:, 32] <= 345.0 else to node 224.
										node=223 leaf node.
										node=224 leaf node.
									node=225 test node: go to node 226 if X[:, 22] <= 37.285003662109375 else to node 227.
										node=226 leaf node.
										node=227 leaf node.
			node=228 test node: go to node 229 if X[:, 26] <= 558194112.0 else to node 280.
				node=229 test node: go to node 230 if X[:, 26] <= 22228936.0 else to node 247.
					node=230 test node: go to node 231 if X[:, 1] <= 67.5 else to node 236.
						node=231 test node: go to node 232 if X[:, 8] <= 0.5 else to node 235.
							node=232 test node: go to node 233 if X[:, 32] <= 109.5 else to node 234.
								node=233 leaf node.
								node=234 leaf node.
							node=235 leaf node.
						node=236 test node: go to node 237 if X[:, 32] <= 225.5 else to node 244.
							node=237 test node: go to node 238 if X[:, 32] <= 197.0 else to node 243.
								node=238 test node: go to node 239 if X[:, 15] <= 20.420000076293945 else to node 240.
									node=239 leaf node.
									node=240 test node: go to node 241 if X[:, 17] <= 78.82499694824219 else to node 242.
										node=241 leaf node.
										node=242 leaf node.
								node=243 leaf node.
							node=244 test node: go to node 245 if X[:, 34] <= 37.0 else to node 246.
								node=245 leaf node.
								node=246 leaf node.
					node=247 test node: go to node 248 if X[:, 35] <= 0.5 else to node 261.
						node=248 test node: go to node 249 if X[:, 18] <= 102386.0 else to node 256.
							node=249 test node: go to node 250 if X[:, 25] <= 140.0 else to node 255.
								node=250 test node: go to node 251 if X[:, 7] <= 3.5 else to node 254.
									node=251 test node: go to node 252 if X[:, 16] <= 65.66499328613281 else to node 253.
										node=252 leaf node.
										node=253 leaf node.
									node=254 leaf node.
								node=255 leaf node.
							node=256 test node: go to node 257 if X[:, 28] <= 6.0 else to node 258.
								node=257 leaf node.
								node=258 test node: go to node 259 if X[:, 17] <= 41.994998931884766 else to node 260.
									node=259 leaf node.
									node=260 leaf node.
						node=261 test node: go to node 262 if X[:, 14] <= 95.43000030517578 else to node 277.
							node=262 test node: go to node 263 if X[:, 25] <= 60.5 else to node 270.
								node=263 test node: go to node 264 if X[:, 14] <= 90.97000122070312 else to node 267.
									node=264 test node: go to node 265 if X[:, 29] <= 1.0 else to node 266.
										node=265 leaf node.
										node=266 leaf node.
									node=267 test node: go to node 268 if X[:, 13] <= 99.55000305175781 else to node 269.
										node=268 leaf node.
										node=269 leaf node.
								node=270 test node: go to node 271 if X[:, 0] <= 189999.0 else to node 274.
									node=271 test node: go to node 272 if X[:, 34] <= 1.5 else to node 273.
										node=272 leaf node.
										node=273 leaf node.
									node=274 test node: go to node 275 if X[:, 13] <= 99.10499572753906 else to node 276.
										node=275 leaf node.
										node=276 leaf node.
							node=277 test node: go to node 278 if X[:, 28] <= 7.5 else to node 279.
								node=278 leaf node.
								node=279 leaf node.
				node=280 test node: go to node 281 if X[:, 26] <= 1587760896.0 else to node 310.
					node=281 test node: go to node 282 if X[:, 14] <= 95.01499938964844 else to node 305.
						node=282 test node: go to node 283 if X[:, 14] <= 79.35499572753906 else to node 292.
							node=283 test node: go to node 284 if X[:, 25] <= 151.0 else to node 291.
								node=284 test node: go to node 285 if X[:, 35] <= 13.5 else to node 288.
									node=285 test node: go to node 286 if X[:, 18] <= 133748.5 else to node 287.
										node=286 leaf node.
										node=287 leaf node.
									node=288 test node: go to node 289 if X[:, 23] <= 35905688.0 else to node 290.
										node=289 leaf node.
										node=290 leaf node.
								node=291 leaf node.
							node=292 test node: go to node 293 if X[:, 25] <= 151.0 else to node 298.
								node=293 test node: go to node 294 if X[:, 35] <= 74.0 else to node 297.
									node=294 test node: go to node 295 if X[:, 8] <= 1.5 else to node 296.
										node=295 leaf node.
										node=296 leaf node.
									node=297 leaf node.
								node=298 test node: go to node 299 if X[:, 32] <= 0.5 else to node 302.
									node=299 test node: go to node 300 if X[:, 19] <= 13461266432.0 else to node 301.
										node=300 leaf node.
										node=301 leaf node.
									node=302 test node: go to node 303 if X[:, 35] <= 7.0 else to node 304.
										node=303 leaf node.
										node=304 leaf node.
						node=305 test node: go to node 306 if X[:, 19] <= 1918565504.0 else to node 307.
							node=306 leaf node.
							node=307 test node: go to node 308 if X[:, 25] <= 144.0 else to node 309.
								node=308 leaf node.
								node=309 leaf node.
					node=310 test node: go to node 311 if X[:, 25] <= 148.5 else to node 322.
						node=311 test node: go to node 312 if X[:, 25] <= 63.5 else to node 313.
							node=312 leaf node.
							node=313 test node: go to node 314 if X[:, 35] <= 10.0 else to node 319.
								node=314 test node: go to node 315 if X[:, 16] <= 63.814998626708984 else to node 318.
									node=315 test node: go to node 316 if X[:, 35] <= -11.0 else to node 317.
										node=316 leaf node.
										node=317 leaf node.
									node=318 leaf node.
								node=319 test node: go to node 320 if X[:, 34] <= 1052.0 else to node 321.
									node=320 leaf node.
									node=321 leaf node.
						node=322 test node: go to node 323 if X[:, 17] <= 78.83999633789062 else to node 330.
							node=323 test node: go to node 324 if X[:, 1] <= 1586.0 else to node 327.
								node=324 test node: go to node 325 if X[:, 14] <= 94.91500091552734 else to node 326.
									node=325 leaf node.
									node=326 leaf node.
								node=327 test node: go to node 328 if X[:, 14] <= 72.08500671386719 else to node 329.
									node=328 leaf node.
									node=329 leaf node.
							node=330 test node: go to node 331 if X[:, 34] <= 3.5 else to node 334.
								node=331 test node: go to node 332 if X[:, 14] <= 94.66999816894531 else to node 333.
									node=332 leaf node.
									node=333 leaf node.
								node=334 test node: go to node 335 if X[:, 34] <= 23.0 else to node 336.
									node=335 leaf node.
									node=336 test node: go to node 337 if X[:, 20] <= 0.5 else to node 338.
										node=337 leaf node.
										node=338 leaf node.
		node=339 test node: go to node 340 if X[:, 7] <= 2.5 else to node 485.
			node=340 test node: go to node 341 if X[:, 25] <= 150.5 else to node 436.
				node=341 test node: go to node 342 if X[:, 14] <= 95.0050048828125 else to node 391.
					node=342 test node: go to node 343 if X[:, 26] <= 7309199.5 else to node 368.
						node=343 test node: go to node 344 if X[:, 35] <= 56.0 else to node 359.
							node=344 test node: go to node 345 if X[:, 4] <= 3.5 else to node 352.
								node=345 test node: go to node 346 if X[:, 1] <= 1503.5 else to node 349.
									node=346 test node: go to node 347 if X[:, 1] <= 1488.5 else to node 348.
										node=347 leaf node.
										node=348 leaf node.
									node=349 test node: go to node 350 if X[:, 21] <= 18387.0 else to node 351.
										node=350 leaf node.
										node=351 leaf node.
								node=352 test node: go to node 353 if X[:, 1] <= 163.5 else to node 356.
									node=353 test node: go to node 354 if X[:, 15] <= 80.33000183105469 else to node 355.
										node=354 leaf node.
										node=355 leaf node.
									node=356 test node: go to node 357 if X[:, 19] <= 24401816.0 else to node 358.
										node=357 leaf node.
										node=358 leaf node.
							node=359 test node: go to node 360 if X[:, 32] <= 37.0 else to node 363.
								node=360 test node: go to node 361 if X[:, 1] <= 1108.5 else to node 362.
									node=361 leaf node.
									node=362 leaf node.
								node=363 test node: go to node 364 if X[:, 23] <= 1351751936.0 else to node 367.
									node=364 test node: go to node 365 if X[:, 13] <= 4.904999732971191 else to node 366.
										node=365 leaf node.
										node=366 leaf node.
									node=367 leaf node.
						node=368 test node: go to node 369 if X[:, 15] <= 89.95500183105469 else to node 382.
							node=369 test node: go to node 370 if X[:, 35] <= 8.5 else to node 375.
								node=370 test node: go to node 371 if X[:, 16] <= 70.20500183105469 else to node 374.
									node=371 test node: go to node 372 if X[:, 9] <= 2.5 else to node 373.
										node=372 leaf node.
										node=373 leaf node.
									node=374 leaf node.
								node=375 test node: go to node 376 if X[:, 30] <= 6.5 else to node 379.
									node=376 test node: go to node 377 if X[:, 24] <= 374512864.0 else to node 378.
										node=377 leaf node.
										node=378 leaf node.
									node=379 test node: go to node 380 if X[:, 32] <= 98.5 else to node 381.
										node=380 leaf node.
										node=381 leaf node.
							node=382 test node: go to node 383 if X[:, 32] <= 13.5 else to node 386.
								node=383 test node: go to node 384 if X[:, 1] <= 1096.0 else to node 385.
									node=384 leaf node.
									node=385 leaf node.
								node=386 test node: go to node 387 if X[:, 13] <= 71.55500030517578 else to node 390.
									node=387 test node: go to node 388 if X[:, 24] <= 1149416192.0 else to node 389.
										node=388 leaf node.
										node=389 leaf node.
									node=390 leaf node.
					node=391 test node: go to node 392 if X[:, 25] <= 8.5 else to node 419.
						node=392 test node: go to node 393 if X[:, 17] <= 7.099999904632568 else to node 406.
							node=393 test node: go to node 394 if X[:, 24] <= 1011646016.0 else to node 399.
								node=394 test node: go to node 395 if X[:, 32] <= 126.0 else to node 396.
									node=395 leaf node.
									node=396 test node: go to node 397 if X[:, 24] <= 616725248.0 else to node 398.
										node=397 leaf node.
										node=398 leaf node.
								node=399 test node: go to node 400 if X[:, 22] <= 68.68499755859375 else to node 403.
									node=400 test node: go to node 401 if X[:, 3] <= 15.5 else to node 402.
										node=401 leaf node.
										node=402 leaf node.
									node=403 test node: go to node 404 if X[:, 18] <= 170202.0 else to node 405.
										node=404 leaf node.
										node=405 leaf node.
							node=406 test node: go to node 407 if X[:, 35] <= -255.5 else to node 412.
								node=407 test node: go to node 408 if X[:, 13] <= 27.34000015258789 else to node 409.
									node=408 leaf node.
									node=409 test node: go to node 410 if X[:, 24] <= 1305357056.0 else to node 411.
										node=410 leaf node.
										node=411 leaf node.
								node=412 test node: go to node 413 if X[:, 35] <= 97.5 else to node 416.
									node=413 test node: go to node 414 if X[:, 1] <= 110.0 else to node 415.
										node=414 leaf node.
										node=415 leaf node.
									node=416 test node: go to node 417 if X[:, 19] <= 20098883584.0 else to node 418.
										node=417 leaf node.
										node=418 leaf node.
						node=419 test node: go to node 420 if X[:, 3] <= 14.5 else to node 431.
							node=420 test node: go to node 421 if X[:, 8] <= 1.5 else to node 428.
								node=421 test node: go to node 422 if X[:, 26] <= 7421147136.0 else to node 425.
									node=422 test node: go to node 423 if X[:, 19] <= 46971084800.0 else to node 424.
										node=423 leaf node.
										node=424 leaf node.
									node=425 test node: go to node 426 if X[:, 2] <= 6.0 else to node 427.
										node=426 leaf node.
										node=427 leaf node.
								node=428 test node: go to node 429 if X[:, 16] <= 6.420000076293945 else to node 430.
									node=429 leaf node.
									node=430 leaf node.
							node=431 test node: go to node 432 if X[:, 22] <= 73.95500183105469 else to node 435.
								node=432 test node: go to node 433 if X[:, 22] <= 64.98999786376953 else to node 434.
									node=433 leaf node.
									node=434 leaf node.
								node=435 leaf node.
				node=436 test node: go to node 437 if X[:, 13] <= 90.11500549316406 else to node 472.
					node=437 test node: go to node 438 if X[:, 8] <= 2.5 else to node 461.
						node=438 test node: go to node 439 if X[:, 14] <= 95.0050048828125 else to node 454.
							node=439 test node: go to node 440 if X[:, 35] <= 20.5 else to node 447.
								node=440 test node: go to node 441 if X[:, 15] <= 89.99000549316406 else to node 444.
									node=441 test node: go to node 442 if X[:, 12] <= 1.5 else to node 443.
										node=442 leaf node.
										node=443 leaf node.
									node=444 test node: go to node 445 if X[:, 7] <= 0.5 else to node 446.
										node=445 leaf node.
										node=446 leaf node.
								node=447 test node: go to node 448 if X[:, 0] <= 190307.0 else to node 451.
									node=448 test node: go to node 449 if X[:, 35] <= 79.0 else to node 450.
										node=449 leaf node.
										node=450 leaf node.
									node=451 test node: go to node 452 if X[:, 26] <= 2150684672.0 else to node 453.
										node=452 leaf node.
										node=453 leaf node.
							node=454 test node: go to node 455 if X[:, 15] <= 90.41000366210938 else to node 460.
								node=455 test node: go to node 456 if X[:, 19] <= 62674665472.0 else to node 459.
									node=456 test node: go to node 457 if X[:, 35] <= 12.5 else to node 458.
										node=457 leaf node.
										node=458 leaf node.
									node=459 leaf node.
								node=460 leaf node.
						node=461 test node: go to node 462 if X[:, 24] <= 1808288512.0 else to node 471.
							node=462 test node: go to node 463 if X[:, 35] <= 61.5 else to node 468.
								node=463 test node: go to node 464 if X[:, 14] <= 95.16000366210938 else to node 467.
									node=464 test node: go to node 465 if X[:, 27] <= 11.0 else to node 466.
										node=465 leaf node.
										node=466 leaf node.
									node=467 leaf node.
								node=468 test node: go to node 469 if X[:, 26] <= 1732839424.0 else to node 470.
									node=469 leaf node.
									node=470 leaf node.
							node=471 leaf node.
					node=472 test node: go to node 473 if X[:, 14] <= 95.13499450683594 else to node 482.
						node=473 test node: go to node 474 if X[:, 1] <= 26.5 else to node 475.
							node=474 leaf node.
							node=475 test node: go to node 476 if X[:, 16] <= 0.25 else to node 477.
								node=476 leaf node.
								node=477 test node: go to node 478 if X[:, 14] <= 36.61000061035156 else to node 479.
									node=478 leaf node.
									node=479 test node: go to node 480 if X[:, 2] <= 11.5 else to node 481.
										node=480 leaf node.
										node=481 leaf node.
						node=482 test node: go to node 483 if X[:, 3] <= 14.0 else to node 484.
							node=483 leaf node.
							node=484 leaf node.
			node=485 test node: go to node 486 if X[:, 4] <= 4.5 else to node 543.
				node=486 test node: go to node 487 if X[:, 26] <= 1309163008.0 else to node 516.
					node=487 test node: go to node 488 if X[:, 25] <= 154.0 else to node 509.
						node=488 test node: go to node 489 if X[:, 35] <= 7.0 else to node 504.
							node=489 test node: go to node 490 if X[:, 14] <= 95.58000183105469 else to node 497.
								node=490 test node: go to node 491 if X[:, 32] <= 124.5 else to node 494.
									node=491 test node: go to node 492 if X[:, 26] <= 107225424.0 else to node 493.
										node=492 leaf node.
										node=493 leaf node.
									node=494 test node: go to node 495 if X[:, 14] <= 89.48999786376953 else to node 496.
										node=495 leaf node.
										node=496 leaf node.
								node=497 test node: go to node 498 if X[:, 27] <= 7.5 else to node 501.
									node=498 test node: go to node 499 if X[:, 0] <= 76659.5 else to node 500.
										node=499 leaf node.
										node=500 leaf node.
									node=501 test node: go to node 502 if X[:, 14] <= 96.91500091552734 else to node 503.
										node=502 leaf node.
										node=503 leaf node.
							node=504 test node: go to node 505 if X[:, 13] <= 7.65500020980835 else to node 506.
								node=505 leaf node.
								node=506 test node: go to node 507 if X[:, 16] <= 27.30500030517578 else to node 508.
									node=507 leaf node.
									node=508 leaf node.
						node=509 test node: go to node 510 if X[:, 16] <= 39.51000213623047 else to node 515.
							node=510 test node: go to node 511 if X[:, 3] <= 11.0 else to node 514.
								node=511 test node: go to node 512 if X[:, 1] <= 1187.5 else to node 513.
									node=512 leaf node.
									node=513 leaf node.
								node=514 leaf node.
							node=515 leaf node.
					node=516 test node: go to node 517 if X[:, 24] <= 625124352.0 else to node 530.
						node=517 test node: go to node 518 if X[:, 34] <= 120.0 else to node 529.
							node=518 test node: go to node 519 if X[:, 1] <= 148.5 else to node 522.
								node=519 test node: go to node 520 if X[:, 13] <= 83.52000427246094 else to node 521.
									node=520 leaf node.
									node=521 leaf node.
								node=522 test node: go to node 523 if X[:, 25] <= 156.0 else to node 526.
									node=523 test node: go to node 524 if X[:, 16] <= 17.614999771118164 else to node 525.
										node=524 leaf node.
										node=525 leaf node.
									node=526 test node: go to node 527 if X[:, 27] <= 10.0 else to node 528.
										node=527 leaf node.
										node=528 leaf node.
							node=529 leaf node.
						node=530 test node: go to node 531 if X[:, 25] <= 151.5 else to node 538.
							node=531 test node: go to node 532 if X[:, 21] <= 2990.0 else to node 537.
								node=532 test node: go to node 533 if X[:, 0] <= 210718.5 else to node 536.
									node=533 test node: go to node 534 if X[:, 14] <= 89.93499755859375 else to node 535.
										node=534 leaf node.
										node=535 leaf node.
									node=536 leaf node.
								node=537 leaf node.
							node=538 test node: go to node 539 if X[:, 32] <= 29.0 else to node 540.
								node=539 leaf node.
								node=540 test node: go to node 541 if X[:, 25] <= 195.5 else to node 542.
									node=541 leaf node.
									node=542 leaf node.
				node=543 test node: go to node 544 if X[:, 24] <= 61800152.0 else to node 555.
					node=544 test node: go to node 545 if X[:, 26] <= 1292963328.0 else to node 546.
						node=545 leaf node.
						node=546 test node: go to node 547 if X[:, 16] <= 31.270000457763672 else to node 554.
							node=547 test node: go to node 548 if X[:, 35] <= 12.0 else to node 553.
								node=548 test node: go to node 549 if X[:, 15] <= 82.55000305175781 else to node 552.
									node=549 test node: go to node 550 if X[:, 24] <= 5228071.0 else to node 551.
										node=550 leaf node.
										node=551 leaf node.
									node=552 leaf node.
								node=553 leaf node.
							node=554 leaf node.
					node=555 test node: go to node 556 if X[:, 14] <= 96.91000366210938 else to node 573.
						node=556 test node: go to node 557 if X[:, 35] <= 102.5 else to node 572.
							node=557 test node: go to node 558 if X[:, 17] <= 48.04999923706055 else to node 565.
								node=558 test node: go to node 559 if X[:, 15] <= 90.47000122070312 else to node 562.
									node=559 test node: go to node 560 if X[:, 35] <= -2.5 else to node 561.
										node=560 leaf node.
										node=561 leaf node.
									node=562 test node: go to node 563 if X[:, 18] <= 37986.0 else to node 564.
										node=563 leaf node.
										node=564 leaf node.
								node=565 test node: go to node 566 if X[:, 0] <= 176429.5 else to node 569.
									node=566 test node: go to node 567 if X[:, 25] <= 151.0 else to node 568.
										node=567 leaf node.
										node=568 leaf node.
									node=569 test node: go to node 570 if X[:, 33] <= 1.5 else to node 571.
										node=570 leaf node.
										node=571 leaf node.
							node=572 leaf node.
						node=573 test node: go to node 574 if X[:, 32] <= 517.5 else to node 579.
							node=574 test node: go to node 575 if X[:, 21] <= 1.0 else to node 576.
								node=575 leaf node.
								node=576 test node: go to node 577 if X[:, 0] <= 181213.0 else to node 578.
									node=577 leaf node.
									node=578 leaf node.
							node=579 leaf node.
	node=580 test node: go to node 581 if X[:, 15] <= 90.0050048828125 else to node 882.
		node=581 test node: go to node 582 if X[:, 26] <= 2391119.5 else to node 725.
			node=582 test node: go to node 583 if X[:, 7] <= 2.5 else to node 694.
				node=583 test node: go to node 584 if X[:, 14] <= 96.22500610351562 else to node 637.
					node=584 test node: go to node 585 if X[:, 1] <= 1792.0 else to node 616.
						node=585 test node: go to node 586 if X[:, 35] <= 50.5 else to node 601.
							node=586 test node: go to node 587 if X[:, 20] <= 0.5 else to node 594.
								node=587 test node: go to node 588 if X[:, 4] <= 0.5 else to node 591.
									node=588 test node: go to node 589 if X[:, 1] <= 1510.5 else to node 590.
										node=589 leaf node.
										node=590 leaf node.
									node=591 test node: go to node 592 if X[:, 1] <= 1207.5 else to node 593.
										node=592 leaf node.
										node=593 leaf node.
								node=594 test node: go to node 595 if X[:, 4] <= 0.5 else to node 598.
									node=595 test node: go to node 596 if X[:, 1] <= 98.5 else to node 597.
										node=596 leaf node.
										node=597 leaf node.
									node=598 test node: go to node 599 if X[:, 8] <= 2.5 else to node 600.
										node=599 leaf node.
										node=600 leaf node.
							node=601 test node: go to node 602 if X[:, 27] <= 4.5 else to node 609.
								node=602 test node: go to node 603 if X[:, 24] <= 1414519552.0 else to node 606.
									node=603 test node: go to node 604 if X[:, 17] <= 91.09500122070312 else to node 605.
										node=604 leaf node.
										node=605 leaf node.
									node=606 test node: go to node 607 if X[:, 0] <= 113910.0 else to node 608.
										node=607 leaf node.
										node=608 leaf node.
								node=609 test node: go to node 610 if X[:, 6] <= 0.5 else to node 613.
									node=610 test node: go to node 611 if X[:, 15] <= 58.220001220703125 else to node 612.
										node=611 leaf node.
										node=612 leaf node.
									node=613 test node: go to node 614 if X[:, 15] <= 88.07499694824219 else to node 615.
										node=614 leaf node.
										node=615 leaf node.
						node=616 test node: go to node 617 if X[:, 13] <= 7.484999656677246 else to node 628.
							node=617 test node: go to node 618 if X[:, 16] <= 70.16000366210938 else to node 623.
								node=618 test node: go to node 619 if X[:, 16] <= 41.119998931884766 else to node 620.
									node=619 leaf node.
									node=620 test node: go to node 621 if X[:, 27] <= 6.5 else to node 622.
										node=621 leaf node.
										node=622 leaf node.
								node=623 test node: go to node 624 if X[:, 14] <= 69.29499816894531 else to node 625.
									node=624 leaf node.
									node=625 test node: go to node 626 if X[:, 21] <= 1591.5 else to node 627.
										node=626 leaf node.
										node=627 leaf node.
							node=628 test node: go to node 629 if X[:, 3] <= 3.5 else to node 630.
								node=629 leaf node.
								node=630 test node: go to node 631 if X[:, 23] <= 1254252288.0 else to node 634.
									node=631 test node: go to node 632 if X[:, 24] <= 815796928.0 else to node 633.
										node=632 leaf node.
										node=633 leaf node.
									node=634 test node: go to node 635 if X[:, 35] <= 83.0 else to node 636.
										node=635 leaf node.
										node=636 leaf node.
					node=637 test node: go to node 638 if X[:, 1] <= 1724.5 else to node 669.
						node=638 test node: go to node 639 if X[:, 21] <= 1221.5 else to node 654.
							node=639 test node: go to node 640 if X[:, 23] <= 727798592.0 else to node 647.
								node=640 test node: go to node 641 if X[:, 19] <= 11227895808.0 else to node 644.
									node=641 test node: go to node 642 if X[:, 31] <= 1.5 else to node 643.
										node=642 leaf node.
										node=643 leaf node.
									node=644 test node: go to node 645 if X[:, 14] <= 97.83499908447266 else to node 646.
										node=645 leaf node.
										node=646 leaf node.
								node=647 test node: go to node 648 if X[:, 24] <= 889575168.0 else to node 651.
									node=648 test node: go to node 649 if X[:, 6] <= 3.5 else to node 650.
										node=649 leaf node.
										node=650 leaf node.
									node=651 test node: go to node 652 if X[:, 14] <= 96.79499816894531 else to node 653.
										node=652 leaf node.
										node=653 leaf node.
							node=654 test node: go to node 655 if X[:, 27] <= 2.5 else to node 662.
								node=655 test node: go to node 656 if X[:, 32] <= 309.5 else to node 659.
									node=656 test node: go to node 657 if X[:, 24] <= 1528496896.0 else to node 658.
										node=657 leaf node.
										node=658 leaf node.
									node=659 test node: go to node 660 if X[:, 1] <= 133.0 else to node 661.
										node=660 leaf node.
										node=661 leaf node.
								node=662 test node: go to node 663 if X[:, 24] <= 1931621888.0 else to node 666.
									node=663 test node: go to node 664 if X[:, 15] <= 81.80500030517578 else to node 665.
										node=664 leaf node.
										node=665 leaf node.
									node=666 test node: go to node 667 if X[:, 14] <= 96.72999572753906 else to node 668.
										node=667 leaf node.
										node=668 leaf node.
						node=669 test node: go to node 670 if X[:, 19] <= 22465226752.0 else to node 679.
							node=670 test node: go to node 671 if X[:, 15] <= 53.5 else to node 672.
								node=671 leaf node.
								node=672 test node: go to node 673 if X[:, 13] <= 14.09000015258789 else to node 676.
									node=673 test node: go to node 674 if X[:, 3] <= 11.5 else to node 675.
										node=674 leaf node.
										node=675 leaf node.
									node=676 test node: go to node 677 if X[:, 14] <= 97.91999816894531 else to node 678.
										node=677 leaf node.
										node=678 leaf node.
							node=679 test node: go to node 680 if X[:, 24] <= 704335104.0 else to node 687.
								node=680 test node: go to node 681 if X[:, 14] <= 97.51499938964844 else to node 684.
									node=681 test node: go to node 682 if X[:, 14] <= 96.25999450683594 else to node 683.
										node=682 leaf node.
										node=683 leaf node.
									node=684 test node: go to node 685 if X[:, 21] <= 1184608.0 else to node 686.
										node=685 leaf node.
										node=686 leaf node.
								node=687 test node: go to node 688 if X[:, 32] <= 91.0 else to node 691.
									node=688 test node: go to node 689 if X[:, 16] <= 63.34000015258789 else to node 690.
										node=689 leaf node.
										node=690 leaf node.
									node=691 test node: go to node 692 if X[:, 35] <= 88.0 else to node 693.
										node=692 leaf node.
										node=693 leaf node.
				node=694 test node: go to node 695 if X[:, 18] <= 318910.0 else to node 722.
					node=695 test node: go to node 696 if X[:, 35] <= 23.0 else to node 715.
						node=696 test node: go to node 697 if X[:, 16] <= 29.860000610351562 else to node 708.
							node=697 test node: go to node 698 if X[:, 15] <= 43.88500213623047 else to node 703.
								node=698 test node: go to node 699 if X[:, 33] <= 1.5 else to node 702.
									node=699 test node: go to node 700 if X[:, 28] <= 9.0 else to node 701.
										node=700 leaf node.
										node=701 leaf node.
									node=702 leaf node.
								node=703 test node: go to node 704 if X[:, 17] <= 96.94000244140625 else to node 705.
									node=704 leaf node.
									node=705 test node: go to node 706 if X[:, 18] <= 31170.0 else to node 707.
										node=706 leaf node.
										node=707 leaf node.
							node=708 test node: go to node 709 if X[:, 13] <= 98.35499572753906 else to node 714.
								node=709 test node: go to node 710 if X[:, 32] <= 56.5 else to node 711.
									node=710 leaf node.
									node=711 test node: go to node 712 if X[:, 35] <= -2.0 else to node 713.
										node=712 leaf node.
										node=713 leaf node.
								node=714 leaf node.
						node=715 test node: go to node 716 if X[:, 16] <= 21.43000030517578 else to node 717.
							node=716 leaf node.
							node=717 test node: go to node 718 if X[:, 18] <= 43208.0 else to node 721.
								node=718 test node: go to node 719 if X[:, 4] <= 2.5 else to node 720.
									node=719 leaf node.
									node=720 leaf node.
								node=721 leaf node.
					node=722 test node: go to node 723 if X[:, 14] <= 95.48999786376953 else to node 724.
						node=723 leaf node.
						node=724 leaf node.
			node=725 test node: go to node 726 if X[:, 16] <= 69.99000549316406 else to node 815.
				node=726 test node: go to node 727 if X[:, 25] <= 150.5 else to node 774.
					node=727 test node: go to node 728 if X[:, 14] <= 95.0050048828125 else to node 753.
						node=728 test node: go to node 729 if X[:, 13] <= 90.10499572753906 else to node 742.
							node=729 test node: go to node 730 if X[:, 7] <= 2.5 else to node 737.
								node=730 test node: go to node 731 if X[:, 35] <= 17.0 else to node 734.
									node=731 test node: go to node 732 if X[:, 26] <= 6369214464.0 else to node 733.
										node=732 leaf node.
										node=733 leaf node.
									node=734 test node: go to node 735 if X[:, 14] <= 77.63999938964844 else to node 736.
										node=735 leaf node.
										node=736 leaf node.
								node=737 test node: go to node 738 if X[:, 32] <= 175.0 else to node 739.
									node=738 leaf node.
									node=739 test node: go to node 740 if X[:, 24] <= 456417984.0 else to node 741.
										node=740 leaf node.
										node=741 leaf node.
							node=742 test node: go to node 743 if X[:, 28] <= 5.5 else to node 748.
								node=743 test node: go to node 744 if X[:, 35] <= 0.5 else to node 747.
									node=744 test node: go to node 745 if X[:, 3] <= 8.0 else to node 746.
										node=745 leaf node.
										node=746 leaf node.
									node=747 leaf node.
								node=748 test node: go to node 749 if X[:, 30] <= 1.5 else to node 752.
									node=749 test node: go to node 750 if X[:, 4] <= 14.0 else to node 751.
										node=750 leaf node.
										node=751 leaf node.
									node=752 leaf node.
						node=753 test node: go to node 754 if X[:, 8] <= 1.5 else to node 767.
							node=754 test node: go to node 755 if X[:, 35] <= 20.5 else to node 762.
								node=755 test node: go to node 756 if X[:, 35] <= -0.5 else to node 759.
									node=756 test node: go to node 757 if X[:, 24] <= 1093317248.0 else to node 758.
										node=757 leaf node.
										node=758 leaf node.
									node=759 test node: go to node 760 if X[:, 6] <= 4.5 else to node 761.
										node=760 leaf node.
										node=761 leaf node.
								node=762 test node: go to node 763 if X[:, 7] <= 2.0 else to node 766.
									node=763 test node: go to node 764 if X[:, 22] <= 30.770000457763672 else to node 765.
										node=764 leaf node.
										node=765 leaf node.
									node=766 leaf node.
							node=767 test node: go to node 768 if X[:, 15] <= 67.04499816894531 else to node 771.
								node=768 test node: go to node 769 if X[:, 27] <= 6.5 else to node 770.
									node=769 leaf node.
									node=770 leaf node.
								node=771 test node: go to node 772 if X[:, 32] <= 20.0 else to node 773.
									node=772 leaf node.
									node=773 leaf node.
					node=774 test node: go to node 775 if X[:, 14] <= 94.9949951171875 else to node 800.
						node=775 test node: go to node 776 if X[:, 13] <= 89.93499755859375 else to node 791.
							node=776 test node: go to node 777 if X[:, 7] <= 1.5 else to node 784.
								node=777 test node: go to node 778 if X[:, 8] <= 2.5 else to node 781.
									node=778 test node: go to node 779 if X[:, 24] <= 2126732800.0 else to node 780.
										node=779 leaf node.
										node=780 leaf node.
									node=781 test node: go to node 782 if X[:, 29] <= 4.5 else to node 783.
										node=782 leaf node.
										node=783 leaf node.
								node=784 test node: go to node 785 if X[:, 8] <= 0.5 else to node 788.
									node=785 test node: go to node 786 if X[:, 17] <= 94.54499816894531 else to node 787.
										node=786 leaf node.
										node=787 leaf node.
									node=788 test node: go to node 789 if X[:, 32] <= 79.5 else to node 790.
										node=789 leaf node.
										node=790 leaf node.
							node=791 test node: go to node 792 if X[:, 9] <= 1.5 else to node 797.
								node=792 test node: go to node 793 if X[:, 35] <= 18.0 else to node 796.
									node=793 test node: go to node 794 if X[:, 29] <= 7.0 else to node 795.
										node=794 leaf node.
										node=795 leaf node.
									node=796 leaf node.
								node=797 test node: go to node 798 if X[:, 15] <= 85.59500122070312 else to node 799.
									node=798 leaf node.
									node=799 leaf node.
						node=800 test node: go to node 801 if X[:, 13] <= 90.19000244140625 else to node 814.
							node=801 test node: go to node 802 if X[:, 7] <= 2.5 else to node 809.
								node=802 test node: go to node 803 if X[:, 9] <= 1.5 else to node 806.
									node=803 test node: go to node 804 if X[:, 22] <= 74.8800048828125 else to node 805.
										node=804 leaf node.
										node=805 leaf node.
									node=806 test node: go to node 807 if X[:, 8] <= 2.5 else to node 808.
										node=807 leaf node.
										node=808 leaf node.
								node=809 test node: go to node 810 if X[:, 22] <= 86.29000091552734 else to node 813.
									node=810 test node: go to node 811 if X[:, 16] <= 8.454999923706055 else to node 812.
										node=811 leaf node.
										node=812 leaf node.
									node=813 leaf node.
							node=814 leaf node.
				node=815 test node: go to node 816 if X[:, 25] <= 150.5 else to node 843.
					node=816 test node: go to node 817 if X[:, 14] <= 94.94499969482422 else to node 840.
						node=817 test node: go to node 818 if X[:, 35] <= 22.5 else to node 831.
							node=818 test node: go to node 819 if X[:, 9] <= 2.5 else to node 826.
								node=819 test node: go to node 820 if X[:, 1] <= 2809.5 else to node 823.
									node=820 test node: go to node 821 if X[:, 35] <= -1.5 else to node 822.
										node=821 leaf node.
										node=822 leaf node.
									node=823 test node: go to node 824 if X[:, 17] <= 93.36000061035156 else to node 825.
										node=824 leaf node.
										node=825 leaf node.
								node=826 test node: go to node 827 if X[:, 15] <= 72.43000030517578 else to node 830.
									node=827 test node: go to node 828 if X[:, 3] <= 10.0 else to node 829.
										node=828 leaf node.
										node=829 leaf node.
									node=830 leaf node.
							node=831 test node: go to node 832 if X[:, 31] <= 1.5 else to node 839.
								node=832 test node: go to node 833 if X[:, 18] <= 5128.5 else to node 836.
									node=833 test node: go to node 834 if X[:, 15] <= 85.44000244140625 else to node 835.
										node=834 leaf node.
										node=835 leaf node.
									node=836 test node: go to node 837 if X[:, 32] <= 408.5 else to node 838.
										node=837 leaf node.
										node=838 leaf node.
								node=839 leaf node.
						node=840 test node: go to node 841 if X[:, 7] <= 1.5 else to node 842.
							node=841 leaf node.
							node=842 leaf node.
					node=843 test node: go to node 844 if X[:, 8] <= 1.5 else to node 869.
						node=844 test node: go to node 845 if X[:, 23] <= 1851603072.0 else to node 860.
							node=845 test node: go to node 846 if X[:, 14] <= 95.01499938964844 else to node 853.
								node=846 test node: go to node 847 if X[:, 13] <= 90.11000061035156 else to node 850.
									node=847 test node: go to node 848 if X[:, 28] <= 7.5 else to node 849.
										node=848 leaf node.
										node=849 leaf node.
									node=850 test node: go to node 851 if X[:, 16] <= 81.88500213623047 else to node 852.
										node=851 leaf node.
										node=852 leaf node.
								node=853 test node: go to node 854 if X[:, 7] <= 2.0 else to node 857.
									node=854 test node: go to node 855 if X[:, 32] <= 511.5 else to node 856.
										node=855 leaf node.
										node=856 leaf node.
									node=857 test node: go to node 858 if X[:, 21] <= 26473.5 else to node 859.
										node=858 leaf node.
										node=859 leaf node.
							node=860 test node: go to node 861 if X[:, 1] <= 369.5 else to node 866.
								node=861 test node: go to node 862 if X[:, 17] <= 95.30000305175781 else to node 865.
									node=862 test node: go to node 863 if X[:, 13] <= 68.29499816894531 else to node 864.
										node=863 leaf node.
										node=864 leaf node.
									node=865 leaf node.
								node=866 test node: go to node 867 if X[:, 22] <= 86.05000305175781 else to node 868.
									node=867 leaf node.
									node=868 leaf node.
						node=869 test node: go to node 870 if X[:, 4] <= 2.0 else to node 875.
							node=870 test node: go to node 871 if X[:, 1] <= 185.0 else to node 872.
								node=871 leaf node.
								node=872 test node: go to node 873 if X[:, 14] <= 79.73999786376953 else to node 874.
									node=873 leaf node.
									node=874 leaf node.
							node=875 test node: go to node 876 if X[:, 16] <= 84.46499633789062 else to node 881.
								node=876 test node: go to node 877 if X[:, 18] <= 133452.0 else to node 880.
									node=877 test node: go to node 878 if X[:, 25] <= 183.5 else to node 879.
										node=878 leaf node.
										node=879 leaf node.
									node=880 leaf node.
								node=881 leaf node.
		node=882 test node: go to node 883 if X[:, 25] <= 150.5 else to node 1076.
			node=883 test node: go to node 884 if X[:, 16] <= 69.93499755859375 else to node 979.
				node=884 test node: go to node 885 if X[:, 25] <= 4.0 else to node 940.
					node=885 test node: go to node 886 if X[:, 14] <= 95.55500030517578 else to node 913.
						node=886 test node: go to node 887 if X[:, 1] <= 719.5 else to node 902.
							node=887 test node: go to node 888 if X[:, 19] <= 21388382208.0 else to node 895.
								node=888 test node: go to node 889 if X[:, 22] <= 74.27000427246094 else to node 892.
									node=889 test node: go to node 890 if X[:, 23] <= 200641440.0 else to node 891.
										node=890 leaf node.
										node=891 leaf node.
									node=892 test node: go to node 893 if X[:, 14] <= 92.25 else to node 894.
										node=893 leaf node.
										node=894 leaf node.
								node=895 test node: go to node 896 if X[:, 7] <= 2.5 else to node 899.
									node=896 test node: go to node 897 if X[:, 8] <= 2.0 else to node 898.
										node=897 leaf node.
										node=898 leaf node.
									node=899 test node: go to node 900 if X[:, 2] <= 2.5 else to node 901.
										node=900 leaf node.
										node=901 leaf node.
							node=902 test node: go to node 903 if X[:, 1] <= 769.0 else to node 906.
								node=903 test node: go to node 904 if X[:, 0] <= 117610.5 else to node 905.
									node=904 leaf node.
									node=905 leaf node.
								node=906 test node: go to node 907 if X[:, 21] <= 29.5 else to node 910.
									node=907 test node: go to node 908 if X[:, 3] <= 7.5 else to node 909.
										node=908 leaf node.
										node=909 leaf node.
									node=910 test node: go to node 911 if X[:, 23] <= 1010902912.0 else to node 912.
										node=911 leaf node.
										node=912 leaf node.
						node=913 test node: go to node 914 if X[:, 21] <= 54495.0 else to node 927.
							node=914 test node: go to node 915 if X[:, 1] <= 52.5 else to node 920.
								node=915 test node: go to node 916 if X[:, 28] <= 7.5 else to node 917.
									node=916 leaf node.
									node=917 test node: go to node 918 if X[:, 23] <= 231629568.0 else to node 919.
										node=918 leaf node.
										node=919 leaf node.
								node=920 test node: go to node 921 if X[:, 23] <= 1119610112.0 else to node 924.
									node=921 test node: go to node 922 if X[:, 22] <= 52.20500183105469 else to node 923.
										node=922 leaf node.
										node=923 leaf node.
									node=924 test node: go to node 925 if X[:, 22] <= 64.1449966430664 else to node 926.
										node=925 leaf node.
										node=926 leaf node.
							node=927 test node: go to node 928 if X[:, 1] <= 1806.0 else to node 935.
								node=928 test node: go to node 929 if X[:, 9] <= 0.5 else to node 932.
									node=929 test node: go to node 930 if X[:, 24] <= 934176768.0 else to node 931.
										node=930 leaf node.
										node=931 leaf node.
									node=932 test node: go to node 933 if X[:, 0] <= 71347.5 else to node 934.
										node=933 leaf node.
										node=934 leaf node.
								node=935 test node: go to node 936 if X[:, 23] <= 556606016.0 else to node 937.
									node=936 leaf node.
									node=937 test node: go to node 938 if X[:, 14] <= 96.9949951171875 else to node 939.
										node=938 leaf node.
										node=939 leaf node.
					node=940 test node: go to node 941 if X[:, 14] <= 94.9749984741211 else to node 962.
						node=941 test node: go to node 942 if X[:, 9] <= 2.5 else to node 955.
							node=942 test node: go to node 943 if X[:, 13] <= 90.24500274658203 else to node 950.
								node=943 test node: go to node 944 if X[:, 35] <= 12.5 else to node 947.
									node=944 test node: go to node 945 if X[:, 7] <= 1.5 else to node 946.
										node=945 leaf node.
										node=946 leaf node.
									node=947 test node: go to node 948 if X[:, 26] <= 5688552448.0 else to node 949.
										node=948 leaf node.
										node=949 leaf node.
								node=950 test node: go to node 951 if X[:, 26] <= 222318624.0 else to node 952.
									node=951 leaf node.
									node=952 test node: go to node 953 if X[:, 35] <= 17.5 else to node 954.
										node=953 leaf node.
										node=954 leaf node.
							node=955 test node: go to node 956 if X[:, 14] <= 54.18000030517578 else to node 957.
								node=956 leaf node.
								node=957 test node: go to node 958 if X[:, 8] <= 0.5 else to node 959.
									node=958 leaf node.
									node=959 test node: go to node 960 if X[:, 19] <= 6397102080.0 else to node 961.
										node=960 leaf node.
										node=961 leaf node.
						node=962 test node: go to node 963 if X[:, 9] <= 2.5 else to node 974.
							node=963 test node: go to node 964 if X[:, 34] <= 347.0 else to node 969.
								node=964 test node: go to node 965 if X[:, 13] <= 86.72999572753906 else to node 968.
									node=965 test node: go to node 966 if X[:, 35] <= 6.0 else to node 967.
										node=966 leaf node.
										node=967 leaf node.
									node=968 leaf node.
								node=969 test node: go to node 970 if X[:, 27] <= 3.5 else to node 971.
									node=970 leaf node.
									node=971 test node: go to node 972 if X[:, 21] <= 392.0 else to node 973.
										node=972 leaf node.
										node=973 leaf node.
							node=974 test node: go to node 975 if X[:, 34] <= 39.0 else to node 978.
								node=975 test node: go to node 976 if X[:, 26] <= 233811904.0 else to node 977.
									node=976 leaf node.
									node=977 leaf node.
								node=978 leaf node.
				node=979 test node: go to node 980 if X[:, 4] <= 0.5 else to node 1027.
					node=980 test node: go to node 981 if X[:, 26] <= 16707618.0 else to node 1004.
						node=981 test node: go to node 982 if X[:, 1] <= 1824.0 else to node 997.
							node=982 test node: go to node 983 if X[:, 14] <= 96.05999755859375 else to node 990.
								node=983 test node: go to node 984 if X[:, 1] <= 211.0 else to node 987.
									node=984 test node: go to node 985 if X[:, 13] <= 83.28500366210938 else to node 986.
										node=985 leaf node.
										node=986 leaf node.
									node=987 test node: go to node 988 if X[:, 35] <= 56.5 else to node 989.
										node=988 leaf node.
										node=989 leaf node.
								node=990 test node: go to node 991 if X[:, 22] <= 57.105003356933594 else to node 994.
									node=991 test node: go to node 992 if X[:, 24] <= 802494720.0 else to node 993.
										node=992 leaf node.
										node=993 leaf node.
									node=994 test node: go to node 995 if X[:, 14] <= 96.44499969482422 else to node 996.
										node=995 leaf node.
										node=996 leaf node.
							node=997 test node: go to node 998 if X[:, 19] <= 96762904576.0 else to node 1003.
								node=998 test node: go to node 999 if X[:, 22] <= 49.55500030517578 else to node 1002.
									node=999 test node: go to node 1000 if X[:, 24] <= 509284352.0 else to node 1001.
										node=1000 leaf node.
										node=1001 leaf node.
									node=1002 leaf node.
								node=1003 leaf node.
						node=1004 test node: go to node 1005 if X[:, 9] <= 1.5 else to node 1018.
							node=1005 test node: go to node 1006 if X[:, 35] <= 7.5 else to node 1013.
								node=1006 test node: go to node 1007 if X[:, 13] <= 89.23999786376953 else to node 1010.
									node=1007 test node: go to node 1008 if X[:, 16] <= 85.04000091552734 else to node 1009.
										node=1008 leaf node.
										node=1009 leaf node.
									node=1010 test node: go to node 1011 if X[:, 24] <= 643133440.0 else to node 1012.
										node=1011 leaf node.
										node=1012 leaf node.
								node=1013 test node: go to node 1014 if X[:, 14] <= 91.69000244140625 else to node 1017.
									node=1014 test node: go to node 1015 if X[:, 16] <= 70.27999877929688 else to node 1016.
										node=1015 leaf node.
										node=1016 leaf node.
									node=1017 leaf node.
							node=1018 test node: go to node 1019 if X[:, 14] <= 91.2750015258789 else to node 1022.
								node=1019 test node: go to node 1020 if X[:, 35] <= -8.5 else to node 1021.
									node=1020 leaf node.
									node=1021 leaf node.
								node=1022 test node: go to node 1023 if X[:, 7] <= 0.5 else to node 1024.
									node=1023 leaf node.
									node=1024 test node: go to node 1025 if X[:, 14] <= 91.30000305175781 else to node 1026.
										node=1025 leaf node.
										node=1026 leaf node.
					node=1027 test node: go to node 1028 if X[:, 25] <= 19.0 else to node 1059.
						node=1028 test node: go to node 1029 if X[:, 24] <= 1729306880.0 else to node 1044.
							node=1029 test node: go to node 1030 if X[:, 32] <= 4.5 else to node 1037.
								node=1030 test node: go to node 1031 if X[:, 2] <= 7.5 else to node 1034.
									node=1031 test node: go to node 1032 if X[:, 24] <= 947450176.0 else to node 1033.
										node=1032 leaf node.
										node=1033 leaf node.
									node=1034 test node: go to node 1035 if X[:, 3] <= 13.0 else to node 1036.
										node=1035 leaf node.
										node=1036 leaf node.
								node=1037 test node: go to node 1038 if X[:, 1] <= 1827.0 else to node 1041.
									node=1038 test node: go to node 1039 if X[:, 35] <= 73.0 else to node 1040.
										node=1039 leaf node.
										node=1040 leaf node.
									node=1041 test node: go to node 1042 if X[:, 14] <= 96.30000305175781 else to node 1043.
										node=1042 leaf node.
										node=1043 leaf node.
							node=1044 test node: go to node 1045 if X[:, 14] <= 96.91000366210938 else to node 1052.
								node=1045 test node: go to node 1046 if X[:, 15] <= 93.80500030517578 else to node 1049.
									node=1046 test node: go to node 1047 if X[:, 16] <= 73.43000030517578 else to node 1048.
										node=1047 leaf node.
										node=1048 leaf node.
									node=1049 test node: go to node 1050 if X[:, 21] <= 5756.0 else to node 1051.
										node=1050 leaf node.
										node=1051 leaf node.
								node=1052 test node: go to node 1053 if X[:, 32] <= 219.0 else to node 1056.
									node=1053 test node: go to node 1054 if X[:, 22] <= 71.20500183105469 else to node 1055.
										node=1054 leaf node.
										node=1055 leaf node.
									node=1056 test node: go to node 1057 if X[:, 24] <= 1806069888.0 else to node 1058.
										node=1057 leaf node.
										node=1058 leaf node.
						node=1059 test node: go to node 1060 if X[:, 7] <= 1.5 else to node 1069.
							node=1060 test node: go to node 1061 if X[:, 13] <= 88.72999572753906 else to node 1066.
								node=1061 test node: go to node 1062 if X[:, 3] <= 15.5 else to node 1065.
									node=1062 test node: go to node 1063 if X[:, 34] <= 116.0 else to node 1064.
										node=1063 leaf node.
										node=1064 leaf node.
									node=1065 leaf node.
								node=1066 test node: go to node 1067 if X[:, 22] <= 35.27000045776367 else to node 1068.
									node=1067 leaf node.
									node=1068 leaf node.
							node=1069 test node: go to node 1070 if X[:, 14] <= 69.22000122070312 else to node 1071.
								node=1070 leaf node.
								node=1071 test node: go to node 1072 if X[:, 4] <= 21.5 else to node 1073.
									node=1072 leaf node.
									node=1073 test node: go to node 1074 if X[:, 19] <= 2502515968.0 else to node 1075.
										node=1074 leaf node.
										node=1075 leaf node.
			node=1076 test node: go to node 1077 if X[:, 35] <= 47.0 else to node 1154.
				node=1077 test node: go to node 1078 if X[:, 8] <= 2.5 else to node 1123.
					node=1078 test node: go to node 1079 if X[:, 14] <= 94.9949951171875 else to node 1108.
						node=1079 test node: go to node 1080 if X[:, 13] <= 90.0999984741211 else to node 1095.
							node=1080 test node: go to node 1081 if X[:, 18] <= 14990.5 else to node 1088.
								node=1081 test node: go to node 1082 if X[:, 16] <= 70.00999450683594 else to node 1085.
									node=1082 test node: go to node 1083 if X[:, 35] <= -2.0 else to node 1084.
										node=1083 leaf node.
										node=1084 leaf node.
									node=1085 test node: go to node 1086 if X[:, 26] <= 653274496.0 else to node 1087.
										node=1086 leaf node.
										node=1087 leaf node.
								node=1088 test node: go to node 1089 if X[:, 16] <= 70.0050048828125 else to node 1092.
									node=1089 test node: go to node 1090 if X[:, 16] <= 4.815000057220459 else to node 1091.
										node=1090 leaf node.
										node=1091 leaf node.
									node=1092 test node: go to node 1093 if X[:, 7] <= 1.5 else to node 1094.
										node=1093 leaf node.
										node=1094 leaf node.
							node=1095 test node: go to node 1096 if X[:, 23] <= 715575296.0 else to node 1101.
								node=1096 test node: go to node 1097 if X[:, 14] <= 92.76499938964844 else to node 1100.
									node=1097 test node: go to node 1098 if X[:, 35] <= 16.0 else to node 1099.
										node=1098 leaf node.
										node=1099 leaf node.
									node=1100 leaf node.
								node=1101 test node: go to node 1102 if X[:, 16] <= 69.92500305175781 else to node 1105.
									node=1102 test node: go to node 1103 if X[:, 4] <= 12.5 else to node 1104.
										node=1103 leaf node.
										node=1104 leaf node.
									node=1105 test node: go to node 1106 if X[:, 8] <= 1.5 else to node 1107.
										node=1106 leaf node.
										node=1107 leaf node.
						node=1108 test node: go to node 1109 if X[:, 16] <= 69.4949951171875 else to node 1120.
							node=1109 test node: go to node 1110 if X[:, 35] <= -35.0 else to node 1115.
								node=1110 test node: go to node 1111 if X[:, 19] <= 1573767680.0 else to node 1112.
									node=1111 leaf node.
									node=1112 test node: go to node 1113 if X[:, 35] <= -58.0 else to node 1114.
										node=1113 leaf node.
										node=1114 leaf node.
								node=1115 test node: go to node 1116 if X[:, 13] <= 89.79000091552734 else to node 1119.
									node=1116 test node: go to node 1117 if X[:, 14] <= 97.04499816894531 else to node 1118.
										node=1117 leaf node.
										node=1118 leaf node.
									node=1119 leaf node.
							node=1120 test node: go to node 1121 if X[:, 14] <= 98.89500427246094 else to node 1122.
								node=1121 leaf node.
								node=1122 leaf node.
					node=1123 test node: go to node 1124 if X[:, 15] <= 93.97000122070312 else to node 1141.
						node=1124 test node: go to node 1125 if X[:, 14] <= 95.11500549316406 else to node 1136.
							node=1125 test node: go to node 1126 if X[:, 13] <= 88.83500671386719 else to node 1133.
								node=1126 test node: go to node 1127 if X[:, 17] <= 94.61500549316406 else to node 1130.
									node=1127 test node: go to node 1128 if X[:, 35] <= -0.5 else to node 1129.
										node=1128 leaf node.
										node=1129 leaf node.
									node=1130 test node: go to node 1131 if X[:, 8] <= 7.5 else to node 1132.
										node=1131 leaf node.
										node=1132 leaf node.
								node=1133 test node: go to node 1134 if X[:, 26] <= 1286430208.0 else to node 1135.
									node=1134 leaf node.
									node=1135 leaf node.
							node=1136 test node: go to node 1137 if X[:, 27] <= 6.5 else to node 1140.
								node=1137 test node: go to node 1138 if X[:, 23] <= 1115440256.0 else to node 1139.
									node=1138 leaf node.
									node=1139 leaf node.
								node=1140 leaf node.
						node=1141 test node: go to node 1142 if X[:, 14] <= 94.38999938964844 else to node 1151.
							node=1142 test node: go to node 1143 if X[:, 13] <= 77.3499984741211 else to node 1150.
								node=1143 test node: go to node 1144 if X[:, 18] <= 25309.0 else to node 1147.
									node=1144 test node: go to node 1145 if X[:, 30] <= 0.5 else to node 1146.
										node=1145 leaf node.
										node=1146 leaf node.
									node=1147 test node: go to node 1148 if X[:, 3] <= 8.0 else to node 1149.
										node=1148 leaf node.
										node=1149 leaf node.
								node=1150 leaf node.
							node=1151 test node: go to node 1152 if X[:, 2] <= 7.0 else to node 1153.
								node=1152 leaf node.
								node=1153 leaf node.
				node=1154 test node: go to node 1155 if X[:, 16] <= 70.23500061035156 else to node 1178.
					node=1155 test node: go to node 1156 if X[:, 14] <= 94.87999725341797 else to node 1171.
						node=1156 test node: go to node 1157 if X[:, 26] <= 262165889024.0 else to node 1170.
							node=1157 test node: go to node 1158 if X[:, 1] <= 1689.5 else to node 1165.
								node=1158 test node: go to node 1159 if X[:, 24] <= 1485682432.0 else to node 1162.
									node=1159 test node: go to node 1160 if X[:, 16] <= 69.93499755859375 else to node 1161.
										node=1160 leaf node.
										node=1161 leaf node.
									node=1162 test node: go to node 1163 if X[:, 19] <= 11641120768.0 else to node 1164.
										node=1163 leaf node.
										node=1164 leaf node.
								node=1165 test node: go to node 1166 if X[:, 9] <= 2.0 else to node 1169.
									node=1166 test node: go to node 1167 if X[:, 0] <= 195553.0 else to node 1168.
										node=1167 leaf node.
										node=1168 leaf node.
									node=1169 leaf node.
							node=1170 leaf node.
						node=1171 test node: go to node 1172 if X[:, 8] <= 2.0 else to node 1175.
							node=1172 test node: go to node 1173 if X[:, 34] <= 5.0 else to node 1174.
								node=1173 leaf node.
								node=1174 leaf node.
							node=1175 test node: go to node 1176 if X[:, 34] <= 2.5 else to node 1177.
								node=1176 leaf node.
								node=1177 leaf node.
					node=1178 test node: go to node 1179 if X[:, 8] <= 1.5 else to node 1188.
						node=1179 test node: go to node 1180 if X[:, 34] <= 6085.5 else to node 1187.
							node=1180 test node: go to node 1181 if X[:, 14] <= 95.56500244140625 else to node 1186.
								node=1181 test node: go to node 1182 if X[:, 30] <= 6.5 else to node 1183.
									node=1182 leaf node.
									node=1183 test node: go to node 1184 if X[:, 21] <= 146.5 else to node 1185.
										node=1184 leaf node.
										node=1185 leaf node.
								node=1186 leaf node.
							node=1187 leaf node.
						node=1188 test node: go to node 1189 if X[:, 14] <= 91.1199951171875 else to node 1192.
							node=1189 test node: go to node 1190 if X[:, 27] <= 9.5 else to node 1191.
								node=1190 leaf node.
								node=1191 leaf node.
							node=1192 test node: go to node 1193 if X[:, 35] <= 151.5 else to node 1194.
								node=1193 leaf node.
								node=1194 leaf node.

Model Interpretation


3.1 Interpretable Models

3.1.1 Decision Rule

Rule-based systems are designed by defining specific rules that describe an anomaly. The decision rule is a simple IF-THEN statement consisting of a condition and a prediction. A single decision rule or a combination of several rules can be used to make predictions. They typically base on the experience of industry experts and are ideal to detect "known anomalies". These known anomalies are familiar to us as we recognize what is normal and what is not.

Decision rules follow a general structure: IF the conditions are met THEN make a certain prediction:

  • Condition is a conjunction of attributes tests:
    (A1 = v1) and (A2 = v2) and ... and (An = vn)
  • Prediction is the class label

Quality of a classification rule can be evaluated by:

  • Support or coverage of a rule: The percentage of instances to which the condition of a rule applies is called the support.
  • Accuracy or confidence of a rule: The accuracy of a rule is a measure of how accurate the rule is in predicting the correct class for the instances to which the condition of the rule applies.</li> Usually there is a trade-off between accuracy and support: By adding more features to the condition, we can achieve higher accuracy, but lose support.

3.1.2 Advantages and Disadvantages

  • Advantages:

    • The main advantage is easy of interpretation basically a human can understand how the model makes predictions and whether it makes sense. For a specific instance, it is possible to verify that the process worked correctly, and see what the main factors in the prediction were.
    • Decision rules can be as expressive as decision trees while being more compact. Decision trees often also suffer from replicated sub-trees, that is when the splits in a left and a right child node have the same structure.
    • The prediction with IF-THEN rules is fast since only a few binary statements need to be checked to determine which rules apply.
    • Decision rules are robust against monotonous transformations of the input features because only the threshold in the conditions changes. They are also robust against outliers since it only matters if a condition applies or not.
    • IF-THEN rules usually generate sparse models, which means that not many features are included. They select only the relevant features for the model. For example, a linear model assigns a weight to every input feature by default. Features that are irrelevant can simply be ignored by IF-THEN rules.
  • Disadvantages:

    • The research and literature for IF-THEN rules focus on classification and almost completely neglects regression.
    • Many of the older rule-learning algorithms are prone to overfitting.
    • Decision rules are bad in describing linear relationships between features and output.
    • Can be memory and computationally intensive

3.1.3 Interpretable Models

There are many ways to learn rules from data. Some of them are:

  • OneR: learns rules from a single feature. OneR is characterized by its simplicity, interpretability and its use as a benchmark.
  • Sequential Covering: is a general procedure that iteratively learns rules and removes the data points that are covered by the new rule. This procedure is used by many rule learning algorithms.
  • Bayesian Rule Lists: combine pre-mined frequent patterns into a decision list using Bayesian statistics. Using pre-mined patterns is a common approach used by many rule learning algorithms.

3.1.4 RIPPER by WEKA

In this experiment, we choose the RIPPER model which is a variant of the sequential covering algorithm to study. We installed the application Weka to do the experiment. In fact, in Weka the RIPPER model called JRip. It is a basic incremental reduced-error pruning algorithm, based on incremental reduced error pruning (IREP). The main idea of the Sequential covering algorithm: Find a good rule that applies to some of the data points. Remove all data points which are covered by the rule. The goal is creating rules that cover many examples of a class and none or very few of other classs. Repeat the rule-learning and removal of covered points with the remaining points until no more points are left or another stop condition is met. The result is a decision list.

The stop conditions:

  • When the rule is perfect, i.e. accuracy = 1
  • When increase in accuracy gets below a given threshold
  • When the training set cannot be split any further

RIPPER (Repeated Incremental Pruning to Produce Error Reduction) is a variant of the Sequential Covering algorithm. RIPPER is a bit more sophisticated and uses a post-processing phase (rule pruning) to optimize the decision list (or set). RIPPER can run in ordered or unordered mode and generate either a decision list or decision set.

3.1.4 Run Model

First we extract data to CSV file and use this CSV file in Weka application. </div

In [37]:
data.to_csv('anomaly_detection.csv',index=False)

We split 80% data for training and remain 20% data for testing. We train to classify for each class and these are results:

  • Check 1:

    • Number of Rules : 13
    • Select 5 best examples:

      • (CPU >= 90.01) and (DiagnosisFiles <= 187) and (DiagnosisFiles >= 97) and (InstanceMEM <= 73.02) => Check1=1 (242.0/0.0)
      • (CPU >= 90.01) and (Dumps <= 0) and (Disk >= 71.19) and (BlockingPhaseSec <= 1570) => Check1=1 (485.0/5.0)
      • (CPU >= 90.01) and (DiagnosisFilesSize >= 1342487431) and (LargestTableSize >= 173440246) => Check1=1 (118.0/3.0)
      • (CPU >= 90.01) and (DiagnosisFilesSize <= 1299958138) and (BlockingPhaseSec >= 18) and (MinDailyNumberOfSuccessfulLogBackups <= 123) => Check1=1 (93.0/2.0)
      • => Check1=0 (285325.0/3.0)

        *The numbers in the bracket stand for positive/negative instance for the rule.

    • Time taken to build model: 137.03 seconds
    • Correctly Classified Instances: 57384 - 99.9791 %
    • Incorrectly Classified Instances: 12 - 0.0209 %
    • Root mean squared error: 0.0146
  • Check 2:
    • Number of Rules : 21
    • Select 5 best examples:
      • (InstanceMEM >= 90) and (Check4 = 1) and (PhysMEM <= 94.67) and (HighPriorityAlerts <= 6) and (SystemID <= 1779) => Check2=1 (4360.0/5.0)
      • (InstanceMEM >= 90) and (Check4 = 1) and (MinDailyNumberOfSuccessfulLogBackups <= 230) and (DiagnosisFiles >= 189) => Check2=1 (1410.0/2.0)
      • (InstanceMEM >= 90) and (Check4 = 1) and (DiagnosisFiles <= 187) => Check2=1 (1969.0/14.0)
      • (InstanceMEM >= 90.01) and (Check4 = 1) and (DiagnosisFilesSize >= 1306483280) => Check2=1 (264.0/0.0)
      • => Check2=0 (278146.0/8.0)
    • Time taken to build model: 322.76 seconds
    • Correctly Classified Instances: 57357 - 99.9321 %
    • Incorrectly Classified Instances: 39 - 0.0679 %
    • Root mean squared error: 0.0254
  • Check 3:
    • Number of Rules : 38
    • Select 5 best examples:
      • (PhysMEM >= 95) and (MinDailyNumberOfSuccessfulLogBackups <= 161) and (DiagnosisFilesSize >= 1309255655) and (LargestPartitionSize >= 248288896) => Check3=1 (1306.0/11.0)
      • (PhysMEM >= 95) and (DiagnosisFilesSize <= 1302828960) and (DiagnosisFilesSize >= 556677376) and (IndexServerAllocationLimit >= 69.37) => Check3=1 (1921.0/7.0)
      • (PhysMEM >= 95) and (DiagnosisFilesSize <= 556424582) and (SystemID >= 316) and (LogSegmentChange >= -1) and (BlockingPhaseSec >= 11) => Check3=1 (871.0/1.0)
      • (PhysMEM >= 95) and (DiagnosisFilesSize >= 1305327632) and (DeltaSize >= 795176703) and (MinDailyNumberOfSuccessfulLogBackups >= 129) => Check3=1 (685.0/3.0)
      • => Check3=0 (278803.0/46.0)
    • Time taken to build model: 899.03 seconds
    • Correctly Classified Instances: 57346 - 99.9129 %
    • Incorrectly Classified Instances: 50 - 0.0871 %
    • Root mean squared error: 0.0282
  • Check 4:
    • Number of Rules : 30
    • Select 5 best examples:
      • (IndexServerAllocationLimit >= 90) and (PhysMEM <= 94.54) and (DiagnosisFilesSize >= 1304499900) => Check4=1 (9227.0/24.0)
      • (IndexServerAllocationLimit >= 90) and (DiagnosisFilesSize <= 1303836128) and (DiagnosisFiles >= 92) => Check4=1 (10267.0/32.0)
      • (IndexServerAllocationLimit >= 90.36) and (LargestTableSize <= 718377930) => Check4=1 (1394.0/0.0)
      • (IndexServerAllocationLimit >= 90) and (DiagnosisFiles <= 90) and (BlockingPhaseSec >= 36) and (InstanceMEM >= 37.59) => Check4=1 (1036.0/1.0)
      • => Check4=0 (262092.0/74.0)
    • Time taken to build model: 525.73 seconds
    • Correctly Classified Instances: 57339 - 99.9007 %
    • Incorrectly Classified Instances: 57 - 0.0993 %
    • Root mean squared error: 0.0307
  • Check 5:
    • Number of Rules : 18
    • Select 5 best examples:
      • (TablesAllocation >= 70) and (Check4 = 1) and (PhysMEM <= 94.51) and (DeltaSize <= 14295859011) => Check5=1 (1654.0/6.0)
      • (TablesAllocation >= 70.01) and (Check4 = 1) and (Dumps <= 6) => Check5=1 (797.0/12.0)
      • (TablesAllocation >= 70.01) and (IndexServerAllocationLimit <= 89.99) and (DiagnosisFiles <= 184) => Check5=1 (223.0/7.0)
      • (TablesAllocation >= 70.01) and (DiagnosisFiles >= 190) => Check5=1 (163.0/1.0)
      • => Check5=0 (283867.0/23.0)
    • Time taken to build model: 303.88 seconds
    • Correctly Classified Instances: 57366 - 99.9477 %
    • Incorrectly Classified Instances: 30 - 0.0523 %
    • Root mean squared error: 0.0216
  • Check 6:
    • Number of Rules : 41
    • Select 5 best examples:
      • (DiagnosisFiles >= 189) => Check6=1 (46919.0/1.0)
      • (DiagnosisFiles >= 151) and (DiagnosisFiles <= 187) => Check6=1 (25051.0/1.0)
      • (DiagnosisFiles >= 188) and (Check3 = 0) and (Check7 = 0) and (Check4 = 0) and (Check8 = 0) => Check6=1 (6300.0/5.0)
      • (DiagnosisFiles >= 188) and (Dumps >= 1) and (DeltaSize <= 23730303563) and (LargestTableSize >= 797170066) and (SystemID <= 98) => Check6=1 (135.0/2.0)
      • => Check6=0 (206325.0/46.0)
    • Time taken to build model: 581.47 seconds
    • Correctly Classified Instances: 57323 - 99.8728 %
    • Incorrectly Classified Instances: 73 - 0.1272 %
    • Root mean squared error: 0.0333
  • Check 7:
    • Number of Rules : 40
    • Select 5 best examples:
      • (BlockingPhaseSec >= 101) and (LogSegmentChange >= 117) => Check7=1 (1067.0/65.0)
      • (BlockingPhaseSec >= 22) and (LogSegmentChange >= 104) and (PhysMEM <= 70.57) => Check7=1 (125.0/13.0)
      • (LogSegmentChange >= 68) and (IndexServerAllocationLimit >= 54.48) => Check7=1 (550.0/108.0)
      • (LogSegmentChange >= 70) and (BlockingPhaseSec >= 12) and (MinDailyNumberOfSuccessfulLogBackups >= 8) and (DeltaSize <= 7931758667) and (DiagnosisFiles <= 166) => Check7=1 (100.0/8.0)
      • => Check7=0 (281470.0/3347.0)
    • Time taken to build model: 724.97 seconds
    • Correctly Classified Instances: 56418 - 98.296 %
    • Incorrectly Classified Instances: 978 - 1.704 %
    • Root mean squared error: 0.1239
  • Check 8:
    • Number of Rules : 5
    • Select 5 best examples:
      • (IndexServerRestarts >= 3) => Check8=1 (1792.0/0.0)
      • (NameServerRestarts >= 3) => Check8=1 (826.0/0.0)
      • (NameServerRestarts >= 1) and (IndexServerRestarts >= 2) => Check8=1 (216.0/0.0)
      • (NameServerRestarts >= 2) and (IndexServerRestarts >= 1) => Check8=1 (85.0/0.0)
      • => Check8=0 (284060.0/0.0)
    • Time taken to build model: 39.26 seconds
    • Correctly Classified Instances: 57396 - 100 %
    • Incorrectly Classified Instances: 0 - 0 %
    • Root mean squared error: 0

We can see that the results are really good with high Correctly Classified Instances and low Root mean squared error. The time taken to build model is quite quick from 39 to 899 seconds. Corresponding decision rules produces exactly the same predictions with the decision tree. Rule sets can be more perspicuous. Base on the rule list we can understand the errors relevant each class:

  • Check 1: This error is mainly related to high CPU and there are some additional factors as Diagnosis Files, Blocking Phase,...
  • Check 2: This error is mainly related to high InstanceMEM. As we have said above there is a moderate correlation between Check4 ad Check2.
  • Check 3: This error is mainly related to high PhysMEM.
  • Check 4: This error is mainly related to Server Allocation Limit, DiagnosisFile,...
  • Check 5: This error is mainly related to Tables Allocation
  • Check 6: This error is mainly related to Diagnosis Files
  • Check 7: This error is mainly related to Blocking Phase and Log Segment Change
  • Check 8: This error is related to Server Restarts

3.1.5 Parameter Optimasation

We can tune these parameters to get a best model:

  • F number: The number of folds for reduced error pruning. One fold is used as the pruning set. (Default: 3)
  • N number: The minimal weights of instances within a split. (Default: 2)
  • O number: Set the number of runs of optimizations. (Default: 2)
  • D: Whether turn on the debug mode -S number The seed of randomization used in Ripper.(Default: 1)
  • E: Whether NOT check the error rate >= 0.5 in stopping criteria. (default: check)
  • P: Whether NOT use pruning. (default: use pruning)

4. Parameter Optimisation

Irrespective of your choice, it is highly likely that your model will have one or more parameters that require tuning. There are several techniques for carrying out such a procedure, including cross-validation, Bayesian optimisation, and several others. As before, an analysis into which parameter tuning technique best suits your model is expected before proceeding with the optimisation of your model.

5. Model Evaluation

Some form of pre-evaluation will inevitably be required in the preceding sections in order to both select an appropriate model and configure its parameters appropriately. In this final section, you may evaluate other aspects of the model such as:

  • Assessing the running time of your model;
  • Determining whether some aspects can be parallelised;
  • Training the model with smaller subsets of the data.
  • etc.

For the evaluation of the classification results, you should use F1-score for each class and do the average.

N.B. Please note that you are responsible for creating a sensible train/validation/test split. There is no predefined held-out test data.

*. Optional

As you will see in the dataset description, the labels you are going to predict have no meaningful names. Try to understand which kind of anomalies these labels refer to and give sensible names. To do it, you could exploit the output of the interpretable models and/or use a statistical approach with the data you have.

N.B. Please note that the items listed under each heading are neither exhaustive, nor are you expected to explore every given suggestion. Nonetheless, these should serve as a guideline for your work in both this and upcoming challenges. As always, you should use your intuition and understanding in order to decide which analysis best suits the assigned task.

Submission Instructions


  • The goal of this challenge is to construct one or more models to detect anomalies.
  • Your submission will be the HTML version of your notebook exploring the various modelling aspects described above.

Dataset Description


* Location of the Dataset on zoe

The data for this challenge is located at: /mnt/datasets/anomaly

* Files

You have a unique csv file with 36 features and 8 labels. Each record contains aggregate features computed over a given amount of time.

* Attributes

A brief outline of the available attributes is given below.

  1. SessionNumber (INTEGER): it identifies the session on which data is collected;
  2. SystemID (INTEGER): it identifies the system generating the data;
  3. Date (DATE): collection date;
  4. HighPriorityAlerts (INTEGER [0, N]): number of high priority alerts in the session;
  5. Dumps (INTEGER [0, N]): number of memory dumps;
  6. CleanupOOMDumps (INTEGER) [0, N]): number of cleanup OOM dumps;
  7. CompositeOOMDums (INTEGER [0, N]): number of composite OOM dumps;
  8. IndexServerRestarts (INTEGER [0, N]): number of restarts of the index server;
  9. NameServerRestarts (INTEGER [0, N]): number of restarts of the name server;
  10. XSEngineRestarts (INTEGER [0, N]): number of restarts of the XSEngine;
  11. PreprocessorRestarts (INTEGER [0, N]): number of restarts of the preprocessor;
  12. DaemonRestarts (INTEGER [0, N]): number of restarts of the daemon process;
  13. StatisticsServerRestarts (INTEGER [0, N]): number of restarts of the statistics server;
  14. CPU (FLOAT [0, 100]): cpu usage;
  15. PhysMEM (FLOAT [0, 100]): physical memory;
  16. InstanceMEM (FLOAT [0, 100]): memory usage of one instance of the system;
  17. TablesAllocation (FLOAT [0, 100]): memory allocated for tables;
  18. IndexServerAllocationLimit (FLOAT [0, 100]): level of memory used by index server;
  19. ColumnUnloads (INTEGER [0, N]): number of columns unloaded from the tables;
  20. DeltaSize (INTEGER [0, N]): size of the delta store;
  21. MergeErrors BOOLEAN [0, 1]: 1 if there are merge errors;
  22. BlockingPhaseSec (INTEGER [0, N]): blocking phase duration in seconds;
  23. Disk (FLOAT [0, 100]): disk usage;
  24. LargestTableSize (INTEGER [0, N]): size of the largest table;
  25. LargestPartitionSize (INTEGER [0, N]): size of the largest partition of a table;
  26. DiagnosisFiles (INTEGER [0, N]): number of diagnosis files;
  27. DiagnosisFilesSize (INTEGER [0, N]): size of diagnosis files;
  28. DaysWithSuccessfulDataBackups (INTEGER [0, N]): number of days with successful data backups;
  29. DaysWithSuccessfulLogBackups (INTEGER [0, N]): number of days with successful log backups;
  30. DaysWithFailedDataBackups (INTEGER [0, N]): number of days with failed data backups;
  31. DaysWithFailedfulLogBackups (INTEGER [0, N]): number of days with failed log backups;
  32. MinDailyNumberOfSuccessfulDataBackups (INTEGER [0, N]): minimum number of successful data backups per day;
  33. MinDailyNumberOfSuccessfulLogBackups (INTEGER [0, N]): minimum number of successful log backups per day;
  34. MaxDailyNumberOfFailedDataBackups (INTEGER [0, N]): maximum number of failed data backups per day;
  35. MaxDailyNumberOfFailedLogBackups (INTEGER [0, N]): maximum number of failed log backups per day;
  36. LogSegmentChange (INTEGER [0, N]): changes in the number of log segments.

* Labels

Labels are binary. Each label refers to a different anomaly.

  • Check1;
  • Check2;
  • Check3;
  • Check4;
  • Check5;
  • Check6;
  • Check7;
  • Check8;